Applied Reinforcement Learning: Playing Doom with TF-Agents and PPO

Applied Reinforcement Learning: Playing Doom with TF-Agents and PPO

During recent years, enormous advances in Reinforcement Learning (RL) have been showcased by DeepMind, OpenAI and others. Their AIs achieve super-human or similar-to-human capabilities in various games including Atari games, the Chinese board game Go, Dota 2 and Starcraft.

At the same time developing Deep Learning solutions for Supervised Learning became easier and easier with popular frameworks like TensorFlow available to the public. With the new TF-Agents framework, it now also becomes easier and more straightforward to develop Reinforcement Learning solutions with TensorFlow.

In this post, we use TF-Agents to train a neural network agent to play a simple scenario of Doom. We will present the most relevant code parts here. Have a look at our Github Repository to find the full implementation and additional required files.

Since this guide focuses on the usage of TF-Agents' high-level APIs, we will not go deep into the details of reinforcement learnining and the used algorithms. If you are new to RL, please check out this awesome series by Arthur Juliani, who gives a great introduction from Q-learnining up to A3C. For more insights into Proximal Policy Optimization (PPO) the OpenAI webpage is a good starting point.

What is TF-Agents?

TF-Agents is TensorFlow's new framework to assist developing RL use cases. For RL to work properly, often all the finest details have to be considered. This can make RL hard to implement on your own. With TF-Agents, all parts required are already implemented for you, so you can concentrate on your use case and the optimization of hyperparameters.

At the moment, TF-Agents is still in an early phase of development but already implements several state-of-the-art RL algorithms.

What is Proximal Policy Optimization (PPO)?

PPO is a class of RL algorithms that have become the default RL algorithms for many use cases. According to OpenAI, they perform “comparably or better than state-of-the-art approaches while being much simpler to implement and tune”. Since PPO is a model-free on-policy algorithm that can be used for continuous and discrete action spaces, a great many of use cases can be tackled with it.

For details on PPO and how it works, have a look at this OpenAI introduction.

Scenario: Doom Basic

To be able to let our agent play doom, we utilize the ViZDoom project, which aims to allow “developing AI bots that play Doom using only the visual information”. It itself is based on the ZDoom project that makes Doom playable on modern PCs.

ViZDoom offers several preconfigured scenarios reaching from corridors with several monsters shooting at the player, to labyrinths where the player must find its way to a target room.

To start of easy, we want our agent to learn to handle the “basic” scenario. The map consists of a rectangle where the player is on one side and a monster is spawned at a random position on the other side. The player can only maneuver left and right and shoot his weapon. When the monster is hit, the episode (one match) is finished. A player gets a reward of +101 points for killing the monster and -5 points if it doesn’t kill it within 300 time-steps. Additionally, a reward of -1 is received for every time-step the monster is alive.

The following video shows a fully trained agent performing 10 episodes of this scenario. (Sometimes the muzzle flash is not visible due to the reduced frame rate of the gif.)

Prerequisites

This tutorial assumes you have Python 3.x installed. While most code probably will also work with Python 2, it might require some adaptions.

Please note: At the time of writing, TensorFlow 2.0 is not yet released and only available in version 2.0.0b1. Furthermore, since TF-Agents is not released yet, we should install it from source. (TF-Agent also provides nightly pip packages, but they are currently not updated regularly.)

Installations

Please install the following packages.

TensorFlow

TensorFlow can be installed for CPU with pip3 install -U tensorflow==2.0.0b1. To install TensorFlow with GPU support, have a look at the requrements and install pip3 install -U tensorflow-gpu==2.0.0b1.

TF-Agents

Clone the TF-Agents repository from Github and install it via pip install -e <directory where you clone tf-agents to>

ViZDoom

pip3 install -U vizdoom

OpenCV

apt install python3-opencv

ImageIO

pip3 install -U imageio
apt install ffmpeg

Implementation

To train an RL agent with TF-Agents, we need an environment of the game and configure the RL algorithm. For popular environment sets like gym, Atari or mujoco, TF-Agents offers predefined suites that can be used to load those environments. For Doom, we have to write our own environment.

Doom Environment

The DoomEnvironment will be a small wrapper around ViZDoom’s DoomGame class. This wrapper configures the game with the desired scenario and maps the TF-Agents environment API to the ViZDoom API.

To create an instance of DoomGame, and configure it for our basic scenario, the files basic.cfg and basic.wad are required. The following code configures the instance.

def configure_doom(config_name="basic.cfg"):
    game = DoomGame()
    game.load_config(config_name)
    game.init()
    return game

In the constructor of the DoomEnvironment class, the game is loaded and the number of available actions is saved (other scenarios might have a different number of actions). Furthermore, an action_spec and an observation_spec are declared. These define the format of the actions and observations provided by this environment.

The remainder of the code maps the TF-Agents API to the ViZDoom API:

class DoomEnvironment(py_environment.PyEnvironment):

    def __init__(self):
        super().__init__()

        self._game = self.configure_doom()
        self._num_actions = self._game.get_available_buttons_size()

        self._action_spec = array_spec.BoundedArraySpec(shape=(), dtype=np.int32, minimum=0, maximum=self._num_actions - 1, name='action')
        self._observation_spec = array_spec.BoundedArraySpec(shape=(84, 84, 3), dtype=np.float32, minimum=0, maximum=1, name='observation')

    def _reset(self):
        self._game.new_episode()
        return time_step.restart(self.get_screen_buffer_preprocessed())

    def _step(self, action):
        if self._game.is_episode_finished():
            # The last action ended the episode. Ignore the current action and start a new episode.
            return self.reset()

        # construct one hot encoded action as required by ViZDoom
        one_hot = [0] * self._num_actions
        one_hot[action] = 1

        # execute action and receive reward
        reward = self._game.make_action(one_hot)

        # return transition depending on game state
        if self._game.is_episode_finished():
            return time_step.termination(self.get_screen_buffer_preprocessed(), reward)
        else:
            return time_step.transition(self.get_screen_buffer_preprocessed(), reward)

    def render(self, mode='rgb_array'):
        """ Return image for rendering. """
        return self.get_screen_buffer_frame()

    def get_screen_buffer_preprocessed(self):
        """
        Preprocess frame for agent by:
        - cutout interesting square part of screen
        - downsample cutout to 84x84 (same as used for atari games)
        - normalize images to interval [0,1]
        """
        frame = self.get_screen_buffer_frame()
        cutout = frame[10:-10, 30:-30]
        resized = cv2.resize(cutout, (84, 84))
        return np.divide(resized, 255, dtype=np.float32)

    def get_screen_buffer_frame(self):
        """ Get current screen buffer or an empty screen buffer if episode is finished"""
        if self._game.is_episode_finished():
            return np.zeros((120, 160, 3), dtype=np.float32)
        else:
            return self._game.get_state().screen_buffer

Defining the Agent's Neural Networks

To train an agent with PPO, an actor network and a value network are required. TF-Agents offers classes allowing an easy configuration of the neural networks we want to use. The following code shows how these are configured.

def create_networks(observation_spec, action_spec):
    actor_net = ActorDistributionRnnNetwork(
        observation_spec,
        action_spec,
        conv_layer_params=[(16, 8, 4), (32, 4, 2)],
        input_fc_layer_params=(256,),
        lstm_size=(256,),
        output_fc_layer_params=(128,),
        activation_fn=tf.nn.elu)
    value_net = ValueRnnNetwork(
        observation_spec,
        conv_layer_params=[(16, 8, 4), (32, 4, 2)],
        input_fc_layer_params=(256,),
        lstm_size=(256,),
        output_fc_layer_params=(128,),
        activation_fn=tf.nn.elu)

    return actor_net, value_net

Our two networks have mostly the same structure:

On top of these layers, the actor distribution network adds a FC layer with the number of actions and the value network adds a FC layer with one cell to calculate the value of the observation.

Training with PPO

To train the agent with TF-Agents' PPO agent, we have to create an object of class PPOAgent and provide it with the specifications of time steps (observations, reward, ...) and the allowed actions. Furthermore, we also provide the optimizer (e.g. AdamOptimizer) that should be used during training. To improve the variance during training, we combine multiple DoomEnvironments with TF-Agents' ParallelPyEnvironment. If you run out of GPU memory, you might need to decrease the number of parallel environments.

The PPOAgent requires additional hyperparameters that might be tuned to improve training performance further.

eval_tf_env = tf_py_environment.TFPyEnvironment(DoomEnvironment())
tf_env = tf_py_environment.TFPyEnvironment(parallel_py_environment.ParallelPyEnvironment([DoomEnvironment] * num_parallel_environments))

actor_net, value_net = create_networks(tf_env.observation_spec(), tf_env.action_spec())

global_step = tf.compat.v1.train.get_or_create_global_step()
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate, epsilon=1e-5)

tf_agent = ppo_agent.PPOAgent(
    tf_env.time_step_spec(),
    tf_env.action_spec(),
    optimizer,
    actor_net,
    value_net,
    num_epochs=num_epochs,
    train_step_counter=global_step,
    discount_factor=0.995,
    gradient_clipping=0.5,
    entropy_regularization=1e-2,
    importance_ratio_clipping=0.2,
    use_gae=True,
    use_td_lambda_return=True
)
tf_agent.initialize()

To collect the training samples, a TFUniformReplayBuffer is created. To run the environment and fill up the replay buffer, an instance of DynamicEpisodeDriver is created. It uses the agent's collect_policy to execute the environment.

A training iteration then consists of the following steps:

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(tf_agent.collect_data_spec, batch_size=num_parallel_environments, max_length=replay_buffer_capacity)
collect_driver = dynamic_episode_driver.DynamicEpisodeDriver(tf_env, tf_agent.collect_policy, observers=[replay_buffer.add_batch], num_episodes=collect_episodes_per_iteration)

while trainingIsNotFinished:
    collect_driver.run()
    trajectories = replay_buffer.gather_all()
    tf_agent.train(experience=trajectories)
    replay_buffer.clear()

The full sample code can be found in our repository.

Logging the Training's Progress

To get better insights into the training's progress, we can use TensorBoard summaries and create videos of episodes performed by our agent. For the full working code with TensorBoard logging and video creation, please refer to the sample in our repository.

To create videos of the playing agent, we can use the render method of our DoomEnvironment. However, to be able to let the agent process a full episode with the LSTM, we need to utilize the wrapped TFEnvironment. Fortunately, since the TFEnvironment is only a wrapper, we can use both in combination to render our videos with the following snippet.

def create_video(py_environment: PyEnvironment, tf_environment: TFEnvironment, policy: tf_policy.Base, num_episodes=10, video_filename='imageio.mp4'):
    with imageio.get_writer(video_filename, fps=60) as video:
        for episode in range(num_episodes):
            time_step = tf_environment.reset()
            state = policy.get_initial_state(tf_environment.batch_size)

            video.append_data(py_environment.render())
            while not time_step.is_last():
                policy_step: PolicyStep = policy.action(time_step, state)
                state = policy_step.state
                time_step = tf_environment.step(policy_step.action)
                video.append_data(py_environment.render())

In the create_video() function, we first reset the environment and get the initial state for the LSTM. Afterwards, the policy is used to calculate the policy_step based on the current time_step (the observation). From the policy_step, we get the new internal LSTM state and the next action to take. The latter one is then executed in the environment to retrieve the next time_step. For every step in the environment, we call the render() method of the PyEnvironment and append the image to the video.

Results

The following results are created with the extended training script available in our repository. The Training was finished after about 15 hours on a GeForce RTX 2080 Ti with 11GB of VRAM.

To view the results in TensorBoard, start it from the command line via tensorboard --logdir=<root_dir>, where <root_dir> is the directory specified when training the agent.

Tensorboard Graphs

In the TensorBoard dashboard, several graphs give insight into the training progress. We can get a good overview of the agent's performance from the graphs in the Metrics section as shown in the following figure. In our case, the blue line visualizes the performance during training and the orange line the performance during evaluations.

In general, the average return should increase towards 100 and the average episode length should decrease towards 0. However, since most of the times, the agent has to move to kill the monster, episode length will be higher than 0 and the reward will be equivalentely lower.

Learning Progress in Videos

The generated videos can give a good impression on the agent's current behaviour and strategy. The following four clips have been taken from a training and show the progression and improvement of the agent over time. Please note that every training behaves a bit differently due to the random initialization and random environment behavior.

After 500 steps: The agent has a left tendency. Only sometimes it steers towards the monster. Furthermore, the agent fires constantly, indicating it learned that firing is the action which can give an instant positive reward.
After 25000 steps: The agent goes directly to the monster but does not always reach it directly. Sometimes it fires multiple times.
After 218000 steps: The agent goes directly to the monster and fires only once and hence achieves maximum efficiency.

Conclusion

In this tutorial, we showed how to create a custom environment for TF-Agents and how to use it to train a neural network agent with Proximal Policy Optimization with just about 250 lines of code. While TF-Agents is still in an early phase of development, it already provides a simple usage of powerfull reinforcement learning algorithms.

Feel free to get in touch if you have any comments or feedback. We hope the tutorial was helpful to get an easy start into reinforcement learning with TF-Agents.