Open AI Gym: Wrappers & Monitors

This article is the third and final article in a three-part series that focuses on the core components of the Gym library. Gym is a Python library developed and maintained by OpenAI, where its purpose is to house a rich collection of environments for Reinforcement Learning (RL) experiments using a unified interface.

If you have only just started your journey with the Gym framework, the official Gym documentation is a great place to start. However, the information it offers is limited. The articles in this series aim to expand this information and provide a deeper understanding of each of the four components Gym offers. In this article, we will focus on Wrappers & Monitors.

Article Contents

  1. Wrappers
  2. Monitors

Wrappers

Wrappers are a method to modularly extend an RL environment's functionality while keeping the original classes separate. Given an environment with a set of observations, a wrapper can provide the ability to accumulate the observations within a replay buffer and output the agent's \(N\) last observations without editing the original class.

Typically, wrappers are "wrapped" around an existing environment and add extra functionality. Gym provides a framework for these situations using the Wrapper class.

Figure 1.1 Wrapper classes hierarchy

There are two sets of locations for wrappers on the Gym GitHub repository. The first is in the core.py file, which contains the main template wrappers, all of which are discussed in this section. The second is inside the wrappers folder that houses a wide variety of utility wrappers for different environment requirements and highlights the following quick tips for creating wrappers:

  • Remember to use super().__init__(env) to override the wrapper's __init__() function.
  • Inner environments are accessible with self.unwrapped.
  • Previous environment layers are accessed using self.env.
  • The variables metadata, action_space, observation_space, reward_range, and spec are copied to self from the previous environment.
  • Wrapped classes require at least one of the following functions: __init__(self, env), step, reset, render, close, or seed.
  • A wrapped (layered) function requires an input from the previous layer (self.env) and/or the inner layer (self.unwrapped).

The Wrapper class inherits from the Env class, where it accepts a single parameter – the instance of the Env class to wrap. The Wrapper class has three child classes that allow filtration of the environments core information. The child wrappers are the ObservationWrapper, RewardWrapper, and ActionWrapper.

Each wrapper has a unique version of the reset() and step(action) functions found in the Env class. Both of these functions are explained in more detail in the Open AI Gym: Environments article. Furthermore, each child wrapper requires an additional class function, highlighted within their respective sections below.

ObservationWrapper

The ObservationWrapper is a wrapper that focuses on observations. It requires a single function observation(obs), where the obs argument is a single observation from the wrapped environment. The observation is passed through the function and updated, based on the developer's requirements, and returned by the function.

RewardWrapper

Like the ObservationWrapper, RewardWrapper requires a single function, reward(rew). However, this function focuses on agent rewards, not observations. The rew argument is a single reward value that gets updated within the function and then returned by it.

ActionWrapper

The last child wrapper, ActionWrapper, focuses on the third critical component to RL algorithms, actions. Similar to the others, it requires a single function action(act). The act argument is a single agent action that passes through the function, is updated, and then returned.

For instance, imagine a situation where we want to amend the stream of actions sent by the agent and change them so that with a probability of 10%, the current action is replaced with a random one. Using this approach, the agent will explore the environment more, assisting in solving the exploration/exploitation problem. In practice, the example would look like the following:

  import gym
  import random
  from typing import TypeVar

  Action = TypeVar('Action')

  class RandomActionWrapper(gym.ActionWrapper):
    """A basic ActionWrapper representation for including random actions."""
    def __init__(self, env: gym.Env, epsilon: float = 0.1) -> None:
      super().__init__(env)
      self.epsilon = epsilon
    
    def action(self, action: Action) -> Action:
      """
      Handles the agent action functionality. Accepts a current action 
      and returns a random action if a random value is less than self.epsilon, 
      otherwise, the current action is taken.
      """
      if random.random() < self.epsilon:
        return self.env.action_space.sample()
      return action

Monitors

The Monitor class is implemented like a Wrapper class and is used to write information about the agent's performance into a file. Monitors are useful for reviewing an agent’s life inside of its environment. In total, there are eight parameters available for the Monitor class. Two are required and the remaining six are optional. The parameters are:

  • env – an Env class object denoting the type of environment to monitor.
  • directory – a string value containing a non-created directory name for storing the monitor information.

While the six optional arguments are as follows:

  • video_callable – accepts the parameters [function, None, False]. If a custom function is provided, it must take in the index of an episode and output a Boolean, indicating whether a video is recorded for the current episode. The default value is None, which assigns the variable to the monitor's capped_cubic_video_schedule(episode_id) function. The parameter can be set to False to disable video recording.
  • force – a Boolean value that works in conjunction with the directory argument, which has a default value of false. If set to true, all existing training data within the given direction gets deleted, and a prefix of "openaigym" is applied to each file.
  • resume – a Boolean value that is set to false by default. If set to true, the training data already in the given directory is retained and merged with the new data.
  • write_upon_reset – a Boolean value set to false by default. When set to true, the monitor will write the manifest (JSON) file on each reset. Warning: this can be computationally expensive.
  • uid – a string value representing a unique id, used as part of the suffix for the JSON file. If the uid is None, one is generated automatically using os.getpid().
  • mode – a string value that is set to None by default. The parameter accepts [evaluation, training] as values to help distinguish between varying episodes when reviewing the results.

Additionally, there are two components required for the Monitor class. The first is the FFmpeg utility for converting captured observations into an output video file. If the utility isn’t available, Monitor will raise an exception.

The second component required is a method for video recording, allowing the monitor to take screenshots of the window drawn by the environment. The recommended approach is to use an Xvfb virtual display, a ‘virtual’ graphical display, where its full name is X11 virtual framebuffer. Xvfb starts a virtual graphical display server and forces the program to draw inside of it.

Using Linux, a standard set of commands to install and run xvfb is seen below:

  sudo apt install xvfb python-opengl ffmpeg
  xvfb-run -s "-screen 0 640x480x24" python filename.py

Creating a Monitor instance using the Gym library can be easily created via the following:

gym.wrappers.Monitor(env, "recording")