Dataset Cards - OG MARL

Dataset Cards - OG MARL

3m

3m - Download

Metadata

Environment nameVersionAgentsAction typeObservation sizeReward type
SMAC (v1)SMAC V1, from OxWhiRL3Discrete[30]Dense

Generation procedure for each dataset

A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.

Summary statistics

UidEpisode return meanMin returnMax returnTransitionsTrajectoriesJoint SACo
Poor4.69 ± 2.140.0020.00997370487790.81
Medium9.96 ± 6.060.0020.00995313416190.85
Good16.49 ± 5.920.0020.00996366435590.80

8m

8m - Download

Metadata

Environment nameVersionAgentsAction typeObservation sizeReward type
SMAC (v1)SMAC V1, from OxWhiRL8Discrete[80]Dense

Generation procedure for each dataset

A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.

Summary statistics

UidEpisode return meanMin returnMax returnTransitionsTrajectoriesJoint SACo
Poor5.28 ± 0.560.007.62995144206290.64
Medium10.14 ± 3.340.0020.00996501392080.96
Good16.86 ± 4.330.1920.00997785306380.86

5m_vs_6m

5m_vs_6m - Download

Metadata

Environment nameVersionAgentsAction typeObservation sizeReward type
SMAC (v1)SMAC V1, from OxWhiRL5Discrete[55]Dense

Generation procedure for each dataset

A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.

Summary statistics

UidEpisode return meanMin returnMax returnTransitionsTrajectoriesJoint SACo
Poor7.45 ± 1.480.0020.00934505455010.85
Medium12.62 ± 5.060.0020.00996856392840.87
Good16.58 ± 4.690.0020.00996727363110.84

2s3z

2s3z - Download

Metadata

Environment nameVersionAgentsAction typeObservation sizeReward type
SMAC (v1)SMAC V1, from OxWhiRL5Discrete[80]Dense

Generation procedure for each dataset

A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.

Summary statistics

UidEpisode return meanMin returnMax returnTransitionsTrajectoriesJoint SACo
Poor6.88 ± 2.060.0013.6199641899420.96
Medium12.57 ± 3.140.0021.30996256186050.98
Good18.32 ± 2.950.0021.62995829186160.98

3s5z_vs_3s6z

3s5z_vs_3s6z - Download

Metadata

Environment nameVersionAgentsAction typeObservation sizeReward type
SMAC (v1)SMAC V1, from OxWhiRL8Discrete[136]Dense

Generation procedure for each dataset

A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.

Summary statistics

UidEpisode return meanMin returnMax returnTransitionsTrajectoriesJoint SACo
Poor5.90 ± 2.220.1911.93996474178070.96
Medium10.69 ± 1.490.0017.67996699188660.97
Good16.56 ± 3.726.3024.4699652873150.97

terran_5_vs_5

terran_5_vs_5 - Download

Metadata

Environment nameVersionAgentsAction typeObservation sizeReward type
SMAC (v2)SMAC V2, from OxWhiRL5Discrete[82]Dense

Generation procedure for each dataset

A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. An epsilon greedy policy with eps=0.05 was used. This procedure was repeated 4 times and the data was combined.

Summary statistics

UidEpisode return meanMin returnMax returnTransitionsTrajectoriesJoint SACo
Replay10.05 ± 5.840.0036.34898164179581.00
Random2.43 ± 1.730.0016.181500000378740.91

terran_10_vs_10

terran_10_vs_10 - Download

Metadata

Environment nameVersionAgentsAction typeObservation sizeReward type
SMAC (v2)SMAC V2, from OxWhiRL10Discrete[162]Dense

Generation procedure for each dataset

A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 1m transitions. An epsilon greedy policy with eps=0.05 was used.

Summary statistics

UidEpisode return meanMin returnMax returnTransitionsTrajectoriesJoint SACo
Replay6.32 ± 3.620.0023.01749850135881.00

zerg_5_vs_5

zerg_5_vs_5 - Download

Metadata

Environment nameVersionAgentsAction typeObservation sizeReward type
SMAC (v2)SMAC V2, from OxWhiRL5Discrete[82]Dense

Generation procedure for each dataset

A QMIX system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 1m transitions. An epsilon greedy policy with eps=0.05 was used.

Summary statistics

UidEpisode return meanMin returnMax returnTransitionsTrajectoriesJoint SACo
Replay7.34 ± 3.600.0024.00863281232941.00

2halfcheetah

2halfcheetah - Download

Metadata

Environment nameVersionAgentsAction typeObservation sizeReward type
MAMuJoCoV1.1, Mujoco v2102Continuous[13]Dense

Generation procedure for each dataset

A MATD3 system was trained to target level of performance. The learnt policy was then rolled out to collect approximately 250k transitions. Gaussian noise with standard deviation of 0.2 was added to the action selection. This procedure was repeated 4 times and the data was combined.

Summary statistics

UidEpisode return meanMin returnMax returnTransitionsTrajectoriesJoint SACo
Poor400.45 ± 333.96-191.49905.03100000010001.00
Medium1485.00 ± 469.14689.432332.17100000010001.00
Good6924.11 ± 1270.39803.129132.25100000010001.00