Skip to content

Maze Environment#

We provide here a Jax JIT-able implementation of a 2D maze problem. The maze is a size-configurable 2D matrix where each cell represents either free space (white) or wall (black).

The goal is for the agent (green) to reach the single target cell (red). It is a sparse reward problem, where the agent receives a reward of 0 at every step and a reward of 1 for reaching the target. The agent may choose to move one space up, right, down, or left: ("N", ā€œEā€, "S", "W"). If the way is blocked by a wall, it will remain at the same position.

Each maze is randomly generated using a recursive division function. By default, a new maze, initial agent position and target position are generated each time the environment is reset.


As an observation, the agent has access to the current maze configuration in the array named walls. It also has access to its current position agent_position, the target's target_position, the number of steps step_count elapsed in the current episode and the action mask action_mask.

  • agent_position: Position(row, col) (int32) each of shape (), agent position in the maze.

  • target_position: Position(row, col) (int32) each of shape (), target position in the maze.

  • walls: jax array (bool) of shape (num_rows, num_cols), indicates whether a grid cell is a wall.

  • step_count: jax array (int32) of shape (), number of steps elapsed in the current episode.

  • action_mask: jax array (bool) of shape (4,), binary values denoting whether each action is possible.

An example 5x5 observation walls array, is shown below. 1 represents a wall, and 0 represents free space.

[0, 1, 0, 0, 0],
[0, 1, 0, 1, 1],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0]


The action space is a DiscreteArray of integer values in the range of [0, 3]. I.e. the agent can take one of four actions: up (0), right (1), down (2), or left (3). If an invalid action is taken, or an action is blocked by a wall, a no-op is performed and the agent's position remains unchanged.


Maze is a sparse reward problem, where the agent receives a reward of 0 at every step and a reward of 1 for reaching the target position. An episode ends when the agent reaches the target position, or after a set number of steps (by default, this is twice the number of cells in the maze, i.e. step_limit=2*num_rows*num_cols).

Registered Versions šŸ“–#

  • Maze-v0, maze with 10 rows and 10 cols.