Skip to content

Multi Agent Capacitated Vehicle Routing Problem - MultiCVRP Environment#

We provide here a Jax JIT-able implementation of the multi-agent capacitated vehicle routing problem (MultiCVRP) which is specified in MVRPSTW.

This environment introduces the problem of routing multiple agents in a coordinated manner, specifically in the context of collecting items from various locations. Each agent controls one vehicle. The problem, called the multi-agent capacitated vehicle routing problem (MultiCVRP), entails directing a group of agents to different locations on a map. They need to collectively go to each node and return items to the depot location. To make the problem a bit more realistic we consider the multi-vehicle routing problem with soft time windows (MVRPSTW). In this formulation, each location on the map also has a soft time window in which the items must be collected. If the items are collected outside this window a penalty is provided to the agents.

A new problem instance is generated by resetting the environment. The problem instance contains coordinates for each node sampled from a uniform distribution inside the map boundries, and each node (except for depot) has a specific demand which is an integer value sampled from a uniform distribution between 1 and the maximum demand. The number of nodes with demand is a parameter of the environment.

Observation#

Each agent receives information on the coordinates, demands, time windows and penalty coefficients of all the customer nodes. Futhermore the agents receive positions, local times and vehicle capacity information on all vehicles. Lastly an action mask is also provided to each agent.

  • node_coordinates: jax array (float32) of shape (num_vehicles, num_customers + 1, 2), shows an array of the coordinates of each customer node and the depot node.
  • node_demands: jax array (int16) of shape (num_vehicles, num_customers + 1,), shows an array of the demands of each city node (and depot node where the demand is set to 0).
  • node_time_windows: jax array (float32) of shape (num_vehicles, num_customers + 1, 2), shows an array of the early and late time cutoffs for each customer.
  • node_penalty_coefs: jax array (float32) of shape (num_vehicles, num_customers + 1, 2), shows the early and late penalty coefficients for arriving early or late at a customer's location.
  • other_vehicles_position: jax array (int16) of shape (num_vehicles, num_vehicles - 1), shows the positions of all other vehicles.
  • other_vehicles_local_times: jax array (float32) of shape (num_vehicles, num_vehicles - 1), shows the local times of all other vehicles.
  • other_vehicles_capacities: jax array (int16) of shape (num_vehicles, num_vehicles - 1), shows the capacities of all other vehicles.
  • vehicle_position: jax array (int16) of shape (num_vehicles), shows the positions of the vehicles controlled by the agents.
  • vehicle_local_time: jax array (float32) of shape (num_vehicles), shows the local times of the vehicles controlled by the agents.
  • vehicle_capacity: jax array (int16) of shape (num_vehicles), shows the capacity of the vehicles controlled by the agents.
  • action_mask: jax array (bool) of shape (num_vehicles, num_customers + 1,), denoting which actions are possible (True) and which are not (False).

Action#

Each agent's action space is a BoundedArray of integer values in the range of [0, num_customers]. An action is the index of the next node to visit, and an action value of 0 corresponds to visiting the depot.

Reward#

  • Dense: The reward is equal to the sum of negative distances of the current location and next location of all the vehicles. Time penalities are added if the agents arrived early or late to specific customers. If the max step limit is reached, the episode ends with a large negative reward which is equal to the maximum negative distance reward that can be incurred.
  • Sparse: The reward is 0 at every step but the last, where the reward is the negative of the length of the path chosen by all the agents combined. Time penalities are added if the agents arrived early or late to specific customers. If the max step limit is reached, the episode ends with a large negative reward which is equal to the maximum negative distance reward that can be incurred.

Registered Versions 📖#

  • MultiCVRP-v0: MultiCVRP problem with 20 customers (randomly generated), maximum capacity of 20, and maximum demand of 10 with two vehicles.