Implementing our first Multi-agent RL
Questions are more important the answers.
What’s Multi-agent RL (sometimes abbreviated to MARL)?
In the normal of RL setup, we have one agent communicating with the environment using the observation, reward, and actions. But in some problems, which often arise in reality, we have serveral agents involved in the environment interation. for examples:
- Multiplayer games, like Dota2 or StarCraft II, when the agent needs to control several units competing with other players’ units.
- Automomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadways.
The MAgent environment
The high-level concepts of MAgent are simple and efficient. It provides the simulation of a grid world that 2D agents inhabit.
For example, the first environment that we will consider is a predator-prey model, where “tigers” hunt “deer” and obtain reward for that.
A random environment
To visualize random environment.
Implement code
import os
import sys
sys.path.append(os.path.join(os.getcwd(), “MAgent/python”))
As MAgent is not installed as a package
import magent
from magent.builtin.rule_model import RandomActorMAP_SIZE = 64
We import the main package provided by MAgent. In addition, we define the size of our environment, which is a 64x64 grid.
if __name__ == “__main__”:
env = magent.GridWorld(“forest”, map_size=MAP_SIZE)
env.set_render_dir(“render”)
First of all, we create the environment, which is represented by the GridWorld class.
deer_handle, tiger_handle = env.get_handles()models = [
RandomActor(env, deer_handle),
RandomActor(env, tiger_handle),
]
Init two models.
env.reset()
env.add_walls(method=”random”, n=MAP_SIZE * MAP_SIZE * 0.04)
env.add_agents(deer_handle, method=”random”, n=5)
env.add_agents(tiger_handle, method=”random”, n=2)
In MAgent terminology, reset() clears the grid completely, which is different from the Gym. the preceding code turn 4% of grid cells into walls by add_walls(), and randomly places five deer and two tigers.
v = env.get_view_space(tiger_handle)
r = env.get_feature_space(tiger_handle)
print(“Tiger view: %s, features: %s” % (v, r))
vv = env.get_view_space(deer_handle)
rr = env.get_feature_space(deer_handle)
print(“Deer view: %s, features: %s” % (vv, rr))
In MAgent, the observation of every agent is divided into two parts: view space and feature space.
done = False
step_idx = 0
while not done:
deer_obs = env.get_observation(deer_handle)
tiger_obs = env.get_observation(tiger_handle)
if step_idx == 0:
print(“Tiger obs: %s, %s” % (
tiger_obs[0].shape, tiger_obs[1].shape))
print(“Deer obs: %s, %s” % (
deer_obs[0].shape, deer_obs[1].shape))
print(“%d: HP deers: %s” % (
step_idx, deer_obs[0][:, 1, 1, 2]))
print(“%d: HP tigers: %s” % (
step_idx, tiger_obs[0][:, 4, 4, 2]))
We start the step loop, where we get observations.
deer_act = models[0].infer_action(deer_obs)
tiger_act = models[1].infer_action(tiger_obs)
env.set_action(deer_handle, deer_act)
env.set_action(tiger_handle, tiger_act)
we will ask the model to select the actions from the observations (which are taken randomly) and pass those actions to the environment.
env.render()
done = env.step()
env.clear_dead()
t_reward = env.get_reward(tiger_handle)
d_reward = env.get_reward(deer_handle)
print(“Rewards: deer %s, tiger %s” % (d_reward, t_reward))
step_idx += 1
First of all, we ask the environment to save information about the agents and their location to be explored later. Then we call env.step()
to do one time step in the grid-world simulation. This function returns a single Boolean flag, which will become True
once all the agents are dead. Then we get the rewards vector for the group, show it, and iterate the loop again.