The first world | Coding for science

Japanese

I implemented the first world. In this post, I set the rules of the first world, write them in the OpenAI Gym format, register it as an OpenAI Gym environment, and use Keras-rl2 to train a neural net for this environment.

I put the code on Github, so I’ll focus on the specifications of the world.

The world is a 30 x 30 2D grid space
There is only one living entity (agent)
The agent has 100 energy at the beginning
Food is distributed randomly (both location and amount (0-100) are random)
Time is discretized in steps
In each step, the agent can either move to one of four directions or stay.
The agent spends 1 energy if it moves or 0.5 energy if it stays.
If the agent overlaps with food, it gains its energy, and get a reward point of the (food amount)/100.
If it uses up the energy, it dies and loses 3 reward points.
In each step, there is a 2% chance that food appears in the world. The amount is random (0-100).
An episode ends if the agent dies or at 2000 steps.

When the environment is registered to OpenAI Gym, doing “pip -e gym-smallworld” keeps the environment editable during development. It makes easy to keep modifying the world specification.

The world is defined now. The network training is pretty much a simplified version of the Atari game example from Keras-rl2. Specifically,

2 layers of 3 x 3 x 16 convolutional layers (stride 1)
1 dense layer with 32 neurons
Scaled exponential linear unit (selu) for activation function
Deep Q Network (DQN) with epsilon greedy Q policy with linear annealing
Adam optimizer with learning rate = 0.001
10000 warm-up steps
the target model is updated every 10000 steps.

The difference from the example is the size of the network (smaller) and the use of selu (which I have a nice experience with). Most other things are the same. No GPU used. If you install appropriate packages (gym, Keras-rl2, etc, with Python 3) it should work fine on a CPU. If it works, the training begins and animation like this will be displayed.

A green dot is the agent and red dots are food. The brighter the dot is, the more food there is there. If you hide this animation with other window and not let it render on your display, it seems that it omits the calculation of this and calculation gets efficient. I don’t know exactly how it works, but it’s a nice tip.

Pretty much what the agent has to do is to move towards any available food. There are no obstacles or enemies, so this is a pretty straightforward task. With this simple training scheme, the episode reward goes up nicely, and saturate near 1.5-2M steps (blue is the epsode reward, orange is the Q-value). It takes about half a day with Core i7-3770k.

An issue is that the current program does not have the same performance when it is switched to the test mode. I’ll investigate the reason for this (I assume that the policy change between the training mode and test mode is the issue).

Anyways, the first world is completed. My next task is to make an environment for multiple agents that could be trained simultaneously and independently.

Author: Shinya

I'm a Scientist at Allen Institute. I'm developing a biophysically realistic model of the primary visual cortex of the mouse. Formerly, I was a postdoc at University of California, Santa Cruz. I received Ph.D. in Physics at Indiana University, Bloomington. This blog is my personal activity and does not represent opinions of the institution where I belong to. View all posts by Shinya