Dataset Cards - OG MARL
3m - Download
Metadata
Environment name | Version | Agents | Action type | Observation size | Reward type |
---|---|---|---|---|---|
SMAC (v1) | SMAC V1, from OxWhiRL | 3 | Discrete | [30] | Dense |
Generation procedure for each dataset
A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.
Summary statistics
Uid | Episode return mean | Min return | Max return | Transitions | Trajectories | Joint SACo |
---|---|---|---|---|---|---|
Poor | 4.69 ± 2.14 | 0.00 | 20.00 | 997370 | 48779 | 0.81 |
Medium | 9.96 ± 6.06 | 0.00 | 20.00 | 995313 | 41619 | 0.85 |
Good | 16.49 ± 5.92 | 0.00 | 20.00 | 996366 | 43559 | 0.80 |
8m - Download
Metadata
Environment name | Version | Agents | Action type | Observation size | Reward type |
---|---|---|---|---|---|
SMAC (v1) | SMAC V1, from OxWhiRL | 8 | Discrete | [80] | Dense |
Generation procedure for each dataset
A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.
Summary statistics
Uid | Episode return mean | Min return | Max return | Transitions | Trajectories | Joint SACo |
---|---|---|---|---|---|---|
Poor | 5.28 ± 0.56 | 0.00 | 7.62 | 995144 | 20629 | 0.64 |
Medium | 10.14 ± 3.34 | 0.00 | 20.00 | 996501 | 39208 | 0.96 |
Good | 16.86 ± 4.33 | 0.19 | 20.00 | 997785 | 30638 | 0.86 |
5m_vs_6m - Download
Metadata
Environment name | Version | Agents | Action type | Observation size | Reward type |
---|---|---|---|---|---|
SMAC (v1) | SMAC V1, from OxWhiRL | 5 | Discrete | [55] | Dense |
Generation procedure for each dataset
A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.
Summary statistics
Uid | Episode return mean | Min return | Max return | Transitions | Trajectories | Joint SACo |
---|---|---|---|---|---|---|
Poor | 7.45 ± 1.48 | 0.00 | 20.00 | 934505 | 45501 | 0.85 |
Medium | 12.62 ± 5.06 | 0.00 | 20.00 | 996856 | 39284 | 0.87 |
Good | 16.58 ± 4.69 | 0.00 | 20.00 | 996727 | 36311 | 0.84 |
2s3z - Download
Metadata
Environment name | Version | Agents | Action type | Observation size | Reward type |
---|---|---|---|---|---|
SMAC (v1) | SMAC V1, from OxWhiRL | 5 | Discrete | [80] | Dense |
Generation procedure for each dataset
A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.
Summary statistics
Uid | Episode return mean | Min return | Max return | Transitions | Trajectories | Joint SACo |
---|---|---|---|---|---|---|
Poor | 6.88 ± 2.06 | 0.00 | 13.61 | 996418 | 9942 | 0.96 |
Medium | 12.57 ± 3.14 | 0.00 | 21.30 | 996256 | 18605 | 0.98 |
Good | 18.32 ± 2.95 | 0.00 | 21.62 | 995829 | 18616 | 0.98 |
3s5z_vs_3s6z - Download
Metadata
Environment name | Version | Agents | Action type | Observation size | Reward type |
---|---|---|---|---|---|
SMAC (v1) | SMAC V1, from OxWhiRL | 8 | Discrete | [136] | Dense |
Generation procedure for each dataset
A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.
Summary statistics
Uid | Episode return mean | Min return | Max return | Transitions | Trajectories | Joint SACo |
---|---|---|---|---|---|---|
Poor | 5.90 ± 2.22 | 0.19 | 11.93 | 996474 | 17807 | 0.96 |
Medium | 10.69 ± 1.49 | 0.00 | 17.67 | 996699 | 18866 | 0.97 |
Good | 16.56 ± 3.72 | 6.30 | 24.46 | 996528 | 7315 | 0.97 |
terran_5_vs_5 - Download
Metadata
Environment name | Version | Agents | Action type | Observation size | Reward type |
---|---|---|---|---|---|
SMAC (v2) | SMAC V2, from OxWhiRL | 5 | Discrete | [82] | Dense |
Generation procedure for each dataset
A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.
Summary statistics
Uid | Episode return mean | Min return | Max return | Transitions | Trajectories | Joint SACo |
---|---|---|---|---|---|---|
Replay | 10.05 ± 5.84 | 0.00 | 36.34 | 898164 | 17958 | 1.00 |
Random | 2.43 ± 1.73 | 0.00 | 16.18 | 1500000 | 37874 | 0.91 |
terran_10_vs_10 - Download
Metadata
Environment name | Version | Agents | Action type | Observation size | Reward type |
---|---|---|---|---|---|
SMAC (v2) | SMAC V2, from OxWhiRL | 10 | Discrete | [162] | Dense |
Generation procedure for each dataset
A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 1m transitions. An epsilon greedy policy with eps=0.05 was used.
Summary statistics
Uid | Episode return mean | Min return | Max return | Transitions | Trajectories | Joint SACo |
---|---|---|---|---|---|---|
Replay | 6.32 ± 3.62 | 0.00 | 23.01 | 749850 | 13588 | 1.00 |
zerg_5_vs_5 - Download
Metadata
Environment name | Version | Agents | Action type | Observation size | Reward type |
---|---|---|---|---|---|
SMAC (v2) | SMAC V2, from OxWhiRL | 5 | Discrete | [82] | Dense |
Generation procedure for each dataset
A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 1m transitions. An epsilon greedy policy with eps=0.05 was used.
Summary statistics
Uid | Episode return mean | Min return | Max return | Transitions | Trajectories | Joint SACo |
---|---|---|---|---|---|---|
Replay | 7.34 ± 3.60 | 0.00 | 24.00 | 863281 | 23294 | 1.00 |
2halfcheetah - Download
Metadata
Environment name | Version | Agents | Action type | Observation size | Reward type |
---|---|---|---|---|---|
MAMuJoCo | V1.1, Mujoco v210 | 2 | Continuous | [13] | Dense |
Generation procedure for each dataset
A MATD3 system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. Gaussian noise with standard deviation of 0.2 was added to the action selection. This procedure was repeated 4 times and the data was combined.
Summary statistics
Uid | Episode return mean | Min return | Max return | Transitions | Trajectories | Joint SACo |
---|---|---|---|---|---|---|
Poor | 400.45 ± 333.96 | -191.49 | 905.03 | 1000000 | 1000 | 1.00 |
Medium | 1485.00 ± 469.14 | 689.43 | 2332.17 | 1000000 | 1000 | 1.00 |
Good | 6924.11 ± 1270.39 | 803.12 | 9132.25 | 1000000 | 1000 | 1.00 |