大トロ

Collective Intelligence for Deep Learning: A Survey of Recent Developments

Sat, 01 Oct 2022 00:00:00 -0500

We survey ideas from complex systems such as swarm intelligence, self-organization, and emergent behavior that are gaining traction in ML. (Figure: Emergence of encirclement tactics in MAgent.)

Introduction

Unless you’ve been living under a rock, you would’ve noticed that artificial neural networks are now used everywhere. They’re impacting our everyday lives, from performing predictive tasks such as recommendations, facial recognition and object classification, to generative tasks such as machine translation and image, sound, video generation. But with all of these advances, the impressive feats in deep learning required a substantial amount of sophisticated engineering effort.

AlexNet. Neural network architecture of AlexNet (Krizhevsky et al. 2012), the winner of the ImageNet competition in 2012.

Even if we look at the early AlexNet from 2012, which made deep learning famous when it won the ImageNet competition back then, we can see the careful engineering decisions that were involved in its design. Modern networks are often even more sophisticated, and require a pipeline that spans network architecture and careful training schemes. Lots of sweat and labor had to go into producing these amazing results.

Engineered vs Emerged Bridges. Left: The Confederation Bridge in Canada. Right: Army ants forming a bridge.

I believe that the way we are currently doing deep learning is like engineering. I think we’re building neural network systems the same way we are building bridges and buildings. But in natural systems, where the concept of emergence plays a big role, we see complex designs that emerge due to self-organization, and such designs are usually sensitive and responsive to changes in the world around them. Natural systems adapt, and become a part of their environment.

“Bridges and buildings are all designed to be indifferent to their environment, to withstand fluctuations, not to adapt to them. The best bridge is one that just stands there, whatever the weather.”

— Andrew Pickering, The Cybernetic Brain.

In the last few years, I have been noticing many works in deep learning research pop up that have been using some of these ideas from collective intelligence, in particular, the area of emergent complex systems. Recently, Yujin Tang and I put together a survey paper called Collective intelligence for deep learning: A survey of recent developments about this topic, and in this post, I will summarize the key themes in our paper.

Historical Background

The reason deep learning took its course could be just an accidental outcome in history, and it didn’t have to be this way. In fact, in the earlier days of neural network development, from the 1980s, many groups, including the group led by Leon Chuo, a legendary electrical engineer, worked on neural networks that are much closer to natural adaptive systems. They developed something called Cellular Neural Networks, which are artificial neural network circuits with grids of artificial neurons.

Cellular Neural Networks. Each neuron in a cellular neural network would receive signals from their neighbors, perform a weighted sum operation, and apply a non-linear activation function, like how we do it today, and send off a signal for its neighbors. The difference between these networks and today’s networks is that they were built using analog circuits, meaning they would work approximately, but also at the time much faster than digital circuits. Also, the wiring of each cell (the ‘parameters’ of each cell) is exactly the same. (Source: The Chua Lectures)

What is remarkable, is that even in the late 1980s, they have shown that these networks can produce amazing results such as object extraction. These analog networks work in nano-seconds, something that we were only able to match decades later in digital circuits. They can be programmed to do non-trivial things, like selecting all objects in a pixel image that are pointing up, and erasing all the other objects. We were able to do these tasks only decades later with deep learning:

Using Cellular Neural Networks to detect all objects that are pointing upwards. Left: Input pixel image. Right: Output pixel image.

In the past few years, we have noticed many works in deep learning research that explore similar ideas as these cellular neural networks from emergent complex systems, which prompted us to write a survey. The problem is that complex systems is a huge topic, including topics that investigate behavior of actual honeybees and ant colonies, and we will limit our discussion to only a few areas focused on machine learning:

Image Processing and Generative Models
Deep Reinforcement Learning
Multi-agent Learning
Meta-learning (“Learning-to-learn”)

Image Generation

We’ll start by discussing the idea of image generation using collective intelligence. One cool example of this is a collective human intelligence: the Reddit r/Place experiment. In this community experiment, Reddit set up a 1000x1000 pixel canvas, so reddit users have to collectively create a megapixel image. But the interesting thing is the constraints Reddit had imposed: each user is only allowed to paint a single pixel every 5 minutes:

Reddit r/Place experiment: Watch a few days of activity happen in minutes.

This experiment lasted for a week, allowing millions of reddit users to draw whatever they want. Because of the time constraint imposed on each user, in order to draw something meaningful, users had to collaborate, and ultimately coordinate some strategy on discussion forums to defend their design, attack other designs, and even form alliances. It is truly an example of the creativity of collective human intelligence.

Early algorithms also computed designs on a pixel grid in a collective way. An example of such an algorithm is a Cellular Automata exemplified in Conway’s Game of Life, where the state of each pixel on a grid is computed based on a function that depends on the states of its neighbors from the previous time step, and based on simple rules, complex patterns can emerge:

Conway’s Game of Life.

A recent work, Neural Cellular Automata (Mordvintsev et al., 2020), tried to extend the concept of CA’s, but replace the simple rules with a neural network, so in a sense, it is really similar to Cellular Neural Networks from the 1980s that I discussed earlier. But in this work, they apply Neural CAs to image generation, like in the Reddit r/Place example, at each time step, a pixel will be randomly chosen and updated based on the output of a single neural network function whose inputs are only the values of the pixel’s immediate neighbors.

Neural Cellular Automata for Image Generation.

They show that a Neural CA can be trained to output any particular given design based on a sparse stochastic sampling rule and an almost empty initial canvas. Here are some examples of 3 Neural CAs producing three designs. What is remarkable about this method is that when we see some corruption in the image, the algorithm would attempt to regenerate the corrupt part automatically in its own way.

Neural Cellular Automata regenerating corrupted images.

Neural CA’s can also perform prediction tasks in a collective fashion. For example, they can be applied to classify MNIST digits, but the difference here is that each pixel must produce its own prediction based on its own pixel, and predictions from its immediate neighbors, so its own prediction will also influence the predictions of its neighbors too and change their opinions over time, like in a democratic society. Over time, usually some consensus is made across the collection of pixels, but sometimes, we can see interesting effects, like if the digit is written in a weird way, there will be different steady states of predictions across different regions of the digit.

Self-classifying MNIST Digits. A Neural Cellular Automata trained to recognize MNIST digits created by (Randazzo et al. 2020) is also available as an interactive web demo. Each cell is only allowed to see the contents of a single pixel and communicate with its neighbors. Over time, a consensus will be formed as to which digit is the most likely pixel, but interestingly, disagreements may result depending on the location of the pixel where the prediction is made.

Neural CA’s are not confined to generating pixels. They can also generate voxels and 3D shapes. Recent work even used Neural CA to produce designs in Minecraft, which are sort of like voxels. They can produce things like buildings and trees, but what’s most interesting is that, since some components inside Minecraft are active rather than passive, they can also generate functional machines with behavior.

Neural CAs have also been applied to the regeneration of Minecraft entities. In this work Sudhakaran, 2021, the authors’ formulation enabled the regeneration of not only Minecraft buildings, trees, but also simple functional machines in the game such as worm-like creatures that can even regenerate into two distinct creatures when cut in half.

Here, they show that when one of these functional machines get cut in half, each half can regenerate itself morphogenetically, to end up with two functional machines.

Morphogenesis. Aside from regeneration, the Neural CA system in Minecraft is able to regrow parts of simple functional machines (such as a virtual creature in the game). They demonstrate a morphogenetic creature growing into 2 distinct creatures when cut in half in Minecraft.

Deep Reinforcement Learning

Another popular area within Deep Learning is to train neural networks with reinforcement learning for tasks like locomotion control. Here are a few examples of these Mujoco Humanoid benchmark environments and their state-of-the-art solutions:

State-of-the-art Mujoco Humanoids. You may not like it, but this is what peak performance looks like.

What usually happens is that all of the input observation states (in the case of the humanoid, we have 376 observations) are fed into a deep neural network, the “policy”, that will output the 17 actions required to control the actuators of the humanoid for it to move forward. Typically, these policy networks tend to overfit the training environment, so you end up with solutions that only work for this exact design and simulation environment.

We’ve seen some interesting works recently that look at using a collective controller approach for these problems. In particular, in Huang et al., 2020, rather than having one policy network take all of the inputs and output all of the actions, here, they use a single shared policy for every actuator in the agent, effectively decomposing an agent into a collection of agents connected by limbs:

Traditional RL methods train a specific policy for a particular robot with a fixed morphology. But recent work, like the one shown here by Huang et al. 2020 attempts to train a single modular neural network responsible for controlling a single part of a robot. The resulting global policy of each robot is thus the result of the coordination of these identical modular neural networks, something which has emerged from local interaction. This system can generalize across a variety of different skeletal structures, from hoppers to quadrupeds, and even to some unseen morphologies.

These policies can communicate bi-directionally with their neighbors, so over time, a global policy can emerge from local interaction. Not only do they train this single policy for one agent design, but it must work across dozens of designs in a training set, so here, every one of these agents are controlled by the same policy that governs each actuator:

One identical neural network controlling every actuator must work across all of these designs.

They show that this type of collective system has some zero-shot generalization capabilities and can also control agents with not only different design variations with different limb lengths and masses, but also novel designs not in the training set, and also deal with unseen challenges:

Well, why rely on a fixed design? Another work, Pathak et al., 2019 looks at getting every limb limb to figure out a way to self-assemble, and learn a design to perform tasks like balancing and locomotion:

Self-assembling limbs. Self-organization also enables systems in RL environments to self-configure its own design for a given task. In Pathak et al., 2019, the authors explored such dynamic and modular agents and showed that they can generalize to not only unseen environments, but also to unseen morphologies composed of additional modules.

They show that this approach can generalize to cases even when you have double or half the number of limbs the system was trained on–something simply not possible with traditional deep RL. Even a system trained with traditional deep RL would work, but the self-assembling solutions consistently prove to be more robust to unseen challenges like wind, or in the case of locomotion, handle new types of terrain such as hurdles and stairs:

This type of collective policy making can also be applied to image-based RL tasks too. In a recent paper that Yujin Tang and I presented at NeurIPS, we looked at feeding each patch from a video feed into identical sensory neuron units, and these sensory neurons must figure out the context of its own input channel, and then self-organize using an attention mechanism for communication, to collectively output motor commands for the agent. This allows the agent to still work even when the patches on the screen are all shuffled:

Sensory Substitution. Using the properties of self-organization and attention, our paper, Tang and Ha, 2021, investigated RL agents that treat their observations as an arbitrarily ordered, variable-length list of sensory inputs. Here, they partition the input in visual tasks such as CarRacing into a 2D grid of small patches, and shuffled their ordering. Each sensory neuron in the system receives a stream of a particular patch of pixels, and through coordination, must complete the task at hand. This agent works with new backgrounds that it hasn’t seen during training (it’s only seen the green grass background).

The work is inspired by the idea of sensory substitution, where different parts of the brain can be retrained to process different sensory modalities, enabling us to adapt our senses to crucial information sources.

Neuroscientist Paul Bach-y-rita (1934-2006) is known as “the father of sensory substitution”.

This method works on non-vision tasks too. When we apply this method to a locomotion task, like this ant agent, we can shuffle the ordering of the 28 inputs quite frequently, and our agent will quickly adjust to a dynamic observation space:

Permutation invariant reinforcement learning agents adapting to sensory substitutions. The ordering of the ant’s 28 observations are randomly shuffled every 200 time-steps. Unlike the standard policy, our policy is not affected by the suddenly permuted inputs.

We can get the agent to play a Puzzle Pong game where the patches are constantly reshuffled, and we show that the system can also work with partial information, like with only 70% of the patches, which are all shuffled:

Multi-agent learning

The earlier reinforcement learning examples were mainly about decomposing a single agent into a smaller collection of agents. But what we do know from complex systems is that emergence often occurs at much larger scales than 10 or 20 agents. Perhaps we need a collection of thousands or more individual agents to interact meaningfully for complex “super organisms” to emerge.

A few years back there was a paper that looked at taking advantage of hardware accelerators, like GPUs, to enable significant scaling up of multi-agent reinforcement learning. In this work called MAgent (Zheng et al., 2018), they proposed a framework to get up to a million agents, though simple ones, to engage in various grid world multi-agent environments, and furthermore, they can have one population of agents pit against another population of agents in a collective self-play manner.

MAgent (Zheng et al., 2018) is a set of environments where large numbers of pixel agents in a gridworld interact in battles or other competitive scenarios. Unlike most platforms that focus on RL research with a single agent or only few agents, their aim is to support RL research that scales up to millions of agents.

The hardware revolution brought about by deep learning can enable us to take advantage of the hardware and use them to train truly large scale collective behavior. In some of these experiments, they observe predator-prey loops, and encirclement tactics emerge from truly large-scale multi-agent reinforcement learning. These macro-level collective intelligence will probably not emerge from traditional small-scale multi-agent environments:

I would like to note that this work was from 2018, and hardware acceleration progress has only exponentially increased since then. A recent demo from NVIDIA last year showcased a physics engine that can now handle thousands of agents acting in a realistic physics simulation, unlike the simple gridworld environment. I believe that in the future, we could see really interesting studies of emergent behavior using these newer technologies.

Recent advances in GPU hardware enables realistic 3D simulation of thousands of robot models, such as the one shown in this figure by Rudin et al. 2021. Such advances open the door for large-scale 3D simulation of artificial agents that can interact with each other and collectively develop intelligent behavior.

Meta-Learning

These increases in compute capabilities won’t stop at simulation. I’ll end with a discussion on how collective behavior is being applied to meta-learning. We can think of an artificial neural network as a collection of neurons and synapses, each of which can be modeled as an individual agent, and collectively, these agents all interact inside a system where the ability to learn is an emergent property.

Currently, our concept of artificial neural networks are simply weight matrices between nodes with a non-linear activation function. But with the extra compute, we can also explore really interesting directions where we can simulate generalized version of neural networks, where perhaps every “neuron” is implemented as an identical recurrent neural network (which can in principle compute anything). I remember several neuroscience papers exploring this theme, see neuroscientist Mark Humphries’s excellent blog post.

Each “Neuron” is an Artificial Neural Network. “If we think the brain is a computer, because it is like a neural network, then now we must admit that individual neurons are computers too. All 17 billion of them in your cortex; perhaps all 86 billion in your brain.” — Mark Humphries

Rather than neurons though, recently, we have seen some pretty ambitious works, modeling the synapse as a recurrent neural network. This is because when we look at how a standard neural network is trained, we go through a forward pass of the network to forward propagate the inputs of the network to the output, and then we use the back propagation algorithm to “back propagate” the error signals back from the output layer to the input layer, using gradients to adjust the weights, so in principle, an RNN synapse can also learn something like the backpropagation rule, or perhaps something even better.

Each Synapse is a Recurrent Neural Network. Recent work by Sandler et al., 2021 and Kirsch and Schmidhuber, 2020 attempt to generalize the accepted notion of artificial neural networks, where each neuron can hold multiple states rather than a scalar value, and each synapse function bi-directionally to facilitate both learning and inference. In this figure, (Kirsch et al. 2021) use an identical recurrent neural network (RNN) (with different internal hidden states) to model each synapse, and show that the network can be trained by simply running the RNNs forward, without using backpropagation.

So rather than relying on this forward and back propagation, we can model each synapse of a neural network with a recurrent neural network, which is a universal computer, to learn how to best forward and back propagate the signals, or learning how to learn. The “hidden states” of each RNN would essentially define what the “weights” are in a highly plastic way.

Recent works, Sandler et al., 2021 and Kirsch and Schmidhuber, 2020, have shown that these approaches are a generalization of back propagation. They can even experimentally train these meta learning network exactly replicate perfectly the back propagation operation and perform stochastic gradient descent. But more importantly, they can evolve learning rules that can learn more efficiently than stochastic gradient descent, or even ADAM.

In the following experiment, Kirsch and Schmidhuber, 2020 trained this type of meta learning system, called variable shared meta learner, the blue line, to learn a learning rule using only the MNIST dataset, where the learning rule here outperforms backprop SGD and Adam baselines, which is expected since the learning rule learned is fine-tuned to the MNIST dataset. But when they test the learning rule on a new dataset, like Fashion-MNIST, they see similar performance gains:

These works are still in their early stages, but I think such approaches of modeling neural networks as a truly collective set of identical neurons or synapses, rather than fixed unique weights, are a really promising direction that will really change the sub-field of meta-learning.

Summary

Neural network systems are highly complex. We may never be able to truly understand how they work at the level as simple idealized systems that can be explained (and predicted) with relatively simple physical laws. I believe that deep learning research can benefit from looking at neural network systems: their construction, training, and deployment, as complex systems. I hope this blog post is a useful survey of several ideas from complex systems that makes neural network systems more robust and adaptive to changes to their environments.

If you are interested in reading more, please check out our paper published in Collective Intelligence.

Citation

If you find this blog post useful, please cite our paper as:

@article{doi:10.1177/26339137221114874, author = {David Ha and Yujin Tang}, title ={Collective intelligence for deep learning: A survey of recent developments}, journal = {Collective Intelligence}, volume = {1}, number = {1}, year = {2022}, doi = {10.1177/26339137221114874}, URL = {https://doi.org/10.1177/26339137221114874}, }

EvoJAX: A Hardware-Accelerated Neuroevolution

Thu, 10 Feb 2022 00:00:00 -0600

EvoJAX is a hardware-accelerated neuroevolution toolkit built on top of JAX. It can help run a wide range of evolution experiments within minutes on a TPU/GPU, compared to hours or days on CPU clusters.

Redirecting to github.com/google/evojax/, where the repo resides.

Permutation-Invariant Neural Networks for Reinforcement Learning

Thu, 18 Nov 2021 00:00:00 -0600

Reinforcement learning agents typically perform poorly if provided with inputs that were not clearly defined in training. A new approach enables RL agents to perform well, even when subject to corrupt, incomplete, or shuffled inputs.

Note: This blog post about our paper is written by Yujin Tang and myself, and was originally posted on Google AI Blog. It has been cross-posted here for archival purposes.

Introduction

“The brain is able to use information coming from the skin as if it were coming from the eyes. We don’t see with the eyes or hear with the ears, these are just the receptors, seeing and hearing in fact goes on in the brain.”

— Paul Bach-y-Rita ¹

People have the amazing ability to use one sensory modality (e.g., touch) to supply environmental information normally gathered by another sense (e.g., vision). This adaptive ability, called sensory substitution, is a phenomenon well-known to neuroscience. While difficult adaptations — such as adjusting to seeing things upside-down, learning to ride a “backwards” bicycle, or learning to “see” by interpreting visual information emitted from a grid of electrodes placed on one’s tongue — require anywhere from weeks, months or even years to attain mastery, people are able to eventually adjust to sensory substitutions.

Examples of Sensory Substitution. Left__: “Tongue Display Unit” (Maris and Bach-y-Rita, 2001; Image: Kaczmarek, 2011). Right: The “backwards brain bicycle” (TED Talk, Figure).

In contrast, most neural networks are not able to adapt to sensory substitutions at all. For instance, most reinforcement learning (RL) agents require their inputs to be in a pre-specified format, or else they will fail. They expect fixed-size inputs and assume that each element of the input carries a precise meaning, such as the pixel intensity at a specified location, or state information, like position or velocity. In popular RL benchmark tasks (e.g., Ant or Cart-pole), an agent trained using current RL algorithms will fail if its sensory inputs are changed or if the agent is fed additional noisy inputs that are unrelated to the task at hand.

In The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning, a spotlight paper at NeurIPS 2021, we explore permutation invariant neural network agents, which require each of their sensory neurons (receptors that receive sensory inputs from the environment) to figure out the meaning and context of its input signal, rather than explicitly assuming a fixed meaning. Our experiments show that such agents are robust to observations that contain additional redundant or noisy information, and to observations that are corrupt and incomplete.

Permutation invariant reinforcement learning agents adapting to sensory substitutions. Left: The ordering of the ant’s 28 observations are randomly shuffled every 200 time-steps. Unlike the standard policy, our policy is not affected by the suddenly permuted inputs. Right: Cart-pole agent given many redundant noisy inputs (Try interactive web-demo).

In addition to adapting to sensory substitutions in state-observation environments (like the ant and cart-pole examples), we show that these agents can also adapt to sensory substitutions in complex visual-observation environments (such as a CarRacing game that uses only pixel observations) and can perform when the stream of input images is constantly being reshuffled:

We partition the visual input from CarRacing into a 2D grid of small patches, and shuffled their ordering (Left). Without any additional training, our agent still performs even when the original training background is replaced with new images (Right).

Method

Our approach takes observations from the environment at each time-step and feeds each element of the observation into distinct, but identical neural networks (called “sensory neurons”), each with no fixed relationship with one another. Each sensory neuron integrates over time information from only their particular sensory input channel. Because each sensory neuron receives only a small part of the full picture, they need to self-organize through communication in order for a global coherent behavior to emerge.

Illustration of observation segmentation. We segment each input into elements, which are then fed to independent sensory neurons. For non-vision tasks where the inputs are usually 1D vectors, each element is a scalar. For vision tasks, we crop each input image into non-overlapping patches.

We encourage neurons to communicate with each other by training them to broadcast messages. While receiving information locally, each individual sensory neuron also continually broadcasts an output message at each time-step. These messages are consolidated and combined into an output vector, called the global latent code, using an attention mechanism similar to that applied in the Transformer architecture. A policy network then uses the global latent code to produce the action that the agent will use to interact with the environment. This action is also fed back into each sensory neuron in the next time-step, closing the communication loop.

Overview of the permutation-invariant RL method. We first feed each individual observation (o_t) into a particular sensory neuron (along with the agent’s previous action, a_t-1). Each neuron then produces and broadcasts a message independently, and an attention mechanism summarizes them into a global latent code (m_t) that is given to the agent’s downstream policy network (𝜋) to produce the agent’s action a_t.

Why is this system permutation invariant? Each sensory neuron is an identical neural network that is not confined to only process information from one particular sensory input. In fact, in our setup, the inputs to each sensory neuron are not defined. Instead, each neuron must figure out the meaning of its input signal by paying attention to the inputs received by the other sensory neurons, rather than explicitly assuming a fixed meaning. This encourages the agent to process the entire input as an unordered set, making the system to be permutation invariant to its input.

The particular form of attention we used has been shown to work with unordered sets. Since our system treats the input as an unordered set, rather than an ordered list, the output will not be affected by the ordering of the sensory neurons (and by extension the ordering of the observations), thus attaining permutation invariance (our paper includes an intuitive explanation about the permutation invariant of attention, for interested readers looking to dive deeper). By processing the input as an unordered set, rather than a fixed-sized list, the agent can use as many sensory neurons as required, thus enabling it to process observations of arbitrary length. Both of these properties will help the agent adapt to sensory substitutions.

Results

We demonstrate the robustness and flexibility of this approach in simpler, state-observation environments, where the observations the agent receives as inputs are low-dimensional vectors holding information about the agent’s states, such as the position or velocity of its components. The agent in the popular Ant locomotion task has a total of 28 inputs with information that includes positions and velocities. We shuffle the order of the input vector several times during a trial and show that the agent is rapidly able to adapt and is still able to walk forward.

In cart-pole, the agent’s goal is to swing up a cart-pole mounted at the center of the cart and balance it upright. Normally the agent sees only five inputs, but we modify the cartpole environment to provide 15 shuffled input signals, 10 of which are pure noise, and the remainder of which are the actual observations from the environment. The agent is still able to perform the task, demonstrating the system’s capacity to work with a large number of inputs and attend only to channels it deems useful. Such flexibility may find useful applications for processing a large unspecified number of signals, most of which are noise, from ill-defined systems.

We also apply this approach to high-dimensional vision-based environments where the observation is a stream of pixel images. Here, we investigate screen-shuffled versions of vision-based RL environments, where each observation frame is divided into a grid of patches, and like a puzzle, the agent must process the patches in a shuffled order to determine a course of action to take. To demonstrate our approach on vision-based tasks, we created a shuffled version of Atari Pong.

Shuffled Pong results. Left: Pong agent trained to play using only 30% of the patches matches performance of Atari opponent. Right: Without extra training, when we give the agent more puzzle pieces, its performance increases.

Here the agent’s input is a variable-length list of patches, so unlike typical RL agents, the agent only gets to “see” a subset of patches from the screen. In the puzzle pong experiment, we pass to the agent a random sample of patches across the screen, which are then fixed through the remainder of the game. We find that we can discard 70% of the patches (at these fixed-random locations) and still train the agent to perform well against the built-in Atari opponent. Interestingly, if we then reveal additional information to the agent (e.g., allowing it access to more image patches), its performance increases, even without additional training. When the agent receives all the patches, in shuffled order, it wins 100% of the time, achieving the same result with agents that are trained while seeing the entire screen.

We find that imposing additional difficulty during training by using unordered observations has additional benefits, such as improving generalization to unseen variations of the task, like when the background of the CarRacing training environment is replaced with a novel image. To understand why the agent is capable of generalizing to new backgrounds, we visualize the patches of the (shuffled) screen to which the agent was paying attention. We find that the absence of fixed-structure in the observations seems to encourage the agent to learn the essential structures in the environment (e.g., road edges) to best perform its task. We see that these attention attributes also transfer over to test environments, helping the agent generalize its policy to new backgrounds.

Shuffled CarRacing results. The agent has learned to focus its attention (indicated by the highlighted patches) on the road boundaries. Left: Training environment. Right: Test environment with new background.

Conclusion

The permutation invariant neural network agents presented here can handle ill-defined, varying observation spaces. Our agents are robust to observations that contain redundant or noisy information, or observations that are corrupt and incomplete. We believe that permutation invariant systems open up numerous possibilities in reinforcement learning.

If you’re interested to learn more about this work, we invite readers to read our interactive article (pdf version) or watch our video. We also released code to reproduce our experiments.

Modern Evolution Strategies for Creativity:
Fitting Concrete Images and Abstract Concepts

Tue, 21 Sep 2021 00:00:00 -0500

“A drawing of a cat”

CLIP + ES + Triangles

Redirecting to es-clip.github.io, where the article resides.

Neuroevolution of Self-Interpretable Agents

Wed, 18 Mar 2020 00:00:00 -0500

Agents with a self-attention “bottleneck” not only can solve these tasks from pixel inputs with only 4000 parameters, but they are also better at generalization.

Redirecting to attentionagent.github.io, where the article resides.

Learning to Predict Without Looking Ahead

Tue, 29 Oct 2019 00:00:00 -0500

Rather than hardcoding forward prediction, we try to get agents to learn that they need to predict the future.

Redirecting to learningtopredict.github.io, where the article resides.

Weight Agnostic Neural Networks

Wed, 12 Jun 2019 00:00:00 -0500

We search for neural network architectures that can already perform various tasks even when they use random weight values.

Redirecting to weightagnostic.github.io, where the article resides.

Learning Latent Dynamics for Planning from Pixels

Fri, 15 Feb 2019 00:00:00 -0600

PlaNet learns a world model from image inputs only and successfully leverages it for planning in latent space.

Redirecting to planetrl.github.io, where the article resides.

Reinforcement Learning for Improving Agent Design

Wed, 10 Oct 2018 00:00:00 -0500

Little dude rewarded for having little legs.

Redirecting to designrl.github.io, where the article resides.

World Models Experiments

Sat, 09 Jun 2018 00:00:00 -0500


GitHub

In this article I will give step-by-step instructions for reproducing the experiments in the World Models article (pdf). The reference TensorFlow implementation is on GitHub.

Other people have implemented World Models independently. There is an implementation in Keras that reproduces part of the CarRacing-v0 experiment. There is also another project in PyTorch that attempts to apply this model on OpenAI Retro Sonic environments.

For general discussion about the World Models article, there are already some good discussion threads here in the GitHub issues page of the interactive article. If you have any issues specific to the code, please don’t hessitate to raise an issue to discuss.

Pre-requisite reading

I recommend reading the following articles to gain some background knowledge before attempting to reproduce the experiments.

World Models (pdf)

A Visual Guide to Evolution Strategies

Evolving Stable Strategies

Below is optional

Mixture Density Networks

Mixture Density Networks with TensorFlow

Read tutorials on Variational Autoencoders if you are not familiar with them. Some Examples:

Variational Autoencoder in TensorFlow

Building Autoencoders in Keras

Generating Large Images from Latent Vectors.

Be familiar with RNNs for continuous sequence generation:

Generating Sequences With Recurrent Neural Networks

A Neural Representation of Sketch Drawings

Handwriting Generation Demo in TensorFlow

Recurrent Neural Network Tutorial for Artists.

Software Settings

I have tested the code with the following settings:

Ubuntu 16.04
Python 3.5.4
TensorFlow 1.8.0
NumPy 1.13.3
VizDoom Gym Levels (Latest commit 60ff576 on Mar 18, 2017)
OpenAI Gym 0.9.4 (Note: Gym 1.0+ breaks this experiment. Only tested for 0.9.x)
cma 2.2.0
mpi4py 2, see estool, which we have forked for this project.
Jupyter Notebook for model testing, and tracking progress.

I use a combination of OS X for inference, but trained models using Google Cloud VMs. I trained the V and M models on a P100 GPU instance, but trained the controller C on pure CPU instance with 64 cpu-cores (n1-standard-64) using CMA-ES. I will outline which part of the training requires GPUs and which parts use only CPUs, and try to keep your costs low for running this experiment.

Instructions for running pre-trained models

You only need to clone the repo into your desktop computer running in CPU-mode to reproduce the results with pre-trained models provided in the repo. No Clould VM or GPUs necessary.

CarRacing-v0

If you are using a MacBook Pro, I recommend setting the resolution to “More Space”, since the CarRacing-v0 environment renders at a larger resolution and doesn’t fit in the default screen settings.

In the command line, go into the carracing subdirectory. Try to play the game yourself, run python env.py in a terminal. You can control the car using the four arrow keys on the keyboard. Press (up, down) for accelerate/brake, and (left/right) for steering.

In this environment, a new random track is generated for each run. While I can consistently get above 800 if I drive very carefully, it is hard for me to consistently get a score above 900 points. Some Stanford students also found it tough to get consistently higher than 900. The requirement to solve this environment is to obtain an average score of 900 over 100 consecutive random trails.

To run the pre-trained model once and see the agent in full-rendered mode, run:

python model.py render log/carracing.cma.16.64.best.json

Run the pre-trained model 100 times in no-render mode (in no-render mode, it still renders something simpler on the screen due to the need to use OpenGL for this environment to extract the pixel information as observations):

python model.py norender log/carracing.cma.16.64.best.json

This command will output the score for each 100 trials, and after running 100 times. It will also output the average score and standard deviation. The average score should be above 900.

To run the pre-trained controller inside of an environment generated using M and visualized using V:

python dream_model.py log/carracing.cma.16.64.best.json

DoomTakeCover-v0

In the doomrnn directory, run python doomrnn.py to play inside of an environment generated by M.

You can hit left, down, or right to play inside of this envrionment. To visualize the pre-trained model playing inside of the real environment, run:

python model.py doomreal render log/doomrnn.cma.16.64.best.json

Note that this environment is modified to also display the cropped 64x64px frames, in addition to the reconstructed frames and actual frames of the game. To run model inside the actual environment 100 times and compute the mean score, run:

python model.py doomreal norender log/doomrnn.cma.16.64.best.json

You should get a mean score of over 900 time-steps over 100 random episodes. The above two lines still work if you substitute doomreal with doomrnn if you want to get the statistics of the agent playing inside of the generated environment. If you wish to change the temperature of the generated environment, modify the constant TEMPERATURE inside doomrnn.py, which is currently set to 1.25.

To visualie the model playing inside of the generated environment, run:

python model.py doomrnn render log/doomrnn.cma.16.64.best.json

Instructions for training everything from scratch

The DoomTakeCover-0 experiment should take less than 24 hours to completely reproduce from scratch using a P100 instance and 64-core CPU instance on Google Cloud Platform.

DoomTakeCover-v0

I will discuss the VizDoom experiment first since it requires less compute time to reproduce from scratch. Since you may update the models in the repo, I recommend that you fork the repo and clone/update on your fork. I recommend running any command inside of a tmux session so that you can close your ssh connections and the jobs will still run on the background.

I first create a 64-core CPU instance with ~ 200GB storage and 220GB RAM, and clone the repo in that instance. In the doomrnn directory, there is a script called extract.py that will extract 200 episodes from a random poilcy, and save the episodes as .npz files in doomrnn/record. A bash script called extract.bash will run extract.py 64 times (~ one job per CPU core), so by running bash extract.bash, we will generate 12,800 .npz files in doomrnn/record. Some instances might randomly fail, so we generate a bit of extra data, although in the end we only use 10,000 episodes for training V and M. This process will take a few hours (probably less than 5 hours).

After the .npz files have been created in the record subdirectory, I create a P100 GPU instance with ~ 200GB storage and 220GB RAM, and clone the repo there too. I use the ssh copy command, scp, to copy all of the .npz files from the CPU instance to the GPU instance, into the same record subdirectory. You can use the gcloud tool if scp doesn’t work. This should be really fast, like less than a minute, if both instances are in the same region. Shut down the CPU instance after you have copied the .npz files over to the GPU machine.

On the GPU machine, run the command bash gpu_jobs.bash to train the VAE, pre-process the recorded dataset, and train the MDN-RNN.

This gpu_jobs.bash will run 3 things in sequential order:

1) python vae_train.py - which will train the VAE, and after training, the model will be saved in tf_vae/vae.json

2) Next, it will pre-process collected data using pre-trained VAE by launching: python series.py. A new dataset will be created in a subdirectory called series.

3) After this a series.npz dataset is saved there, the script will launch the MDN-RNN trainer using this command: python rnn_train.py. This will produce a model in tf_rnn/rnn.json and also tf_initial_z/initial_z.json. The file initial_z.json saves the initial latent variables (z) of an episode which is needed when we need to generate the environment. This entire process might take 6-8 hours.

The notebook vae_test.ipynb will visualize input/reconstruction images using your VAE on the training dataset.

After V and M are trained, and you have the 3 new json files, you must must now copy vae.json, initial_z.json and rnn.json over to tf_models subdirectory and overwrite previous files that might be there. You should update your git repo with these new models using git add doomrnn/tf_models/*.json and committing the change to your fork. After you have done this, you can shutdown the GPU machine. You need to start the 64-core CPU instance again, log back into that machine.

Now on a 64-core CPU instance, run the CMA-ES based training by launching the command: python train.py inside the doomrnn directory. This will launch the evolution trainer and continue training until you Ctrl-C this job. The controller C will be trained inside of M’s generated environment with a temperature of 1.25. You can monitor progress using the plot_training_progress.ipynb notebook which loads the log files being generated. After 200 generations (or around 4-5 hours), it should be enough to get decent results, and you can stop this job. I left my job running for close to 1800 generations, although it doesn’t really add much value after 200 generations, so I prefer not to waste your money. Add all of the files inside log/*.json into your forked repo and then shutdown the instance.

Training DoomRNN using CMA-ES. Recording C's performance inside of the generated environment.

Using your desktop instance, and pulling your forked repo again, you can now run the following to test your newly trained V, M, and C models.

python model.py doomreal render log/doomrnn.cma.16.64.best.json

You can replace doomreal with doomrnn or render to norender to try on the generated environment, or trying your agent 100 times.

CarRacing-v0

The process for CarRacing-v0 is almost the same as the VizDoom example earlier, so I will discuss the differences in this section.

Since this environment is built using OpenGL, it relies on a graphics output even in no-render mode of the gym environment, so in a CloudVM box, I had to wrap the command with a headless X server. You can see that inside the extract.bash file in carracing directory, I run xvfb-run -a -s "-screen 0 1400x900x24 +extension RANDR" before the real command. Other than this, the procedure to collect data, and training the V and M model are the same as VizDoom.

Please note that after you train your VAE and MDN-RNN models, you must now copy vae.json, initial_z.json and rnn.json over to vae, initial_z, and rnn directories respectively (not tf_models like in DoomRNN), and overwrite previous files if they were there, and then update the forked repo as usual.

vae_test.ipynb used to examine the VAE trained on CarRacing-v0's extracted data.

In this environment, we use the V and M model as model predictive control (MPC) and train the controller C on the actual environment, rather than inside of the generated environment. So rather than running python train.py you need to run gce_train.bash instead to use the headless X sessions to run the CMA-ES trainer. Because we train in the actual environment, training is slower compared to DoomRNN. By running the training inside a tmux session, you can monitor progress using the plot_training_progress.ipynb notebook by running Jupyter in another tmux session in parallel, which loads the log files being generated.

Training CarRacing-v0 using CMA-ES. Recording C's performance inside of the actual environment.

After 150-200 generations (or around 3 days), it should be enough to get around a mean score of ~ 880, which is pretty close to the required score of 900. If you don’t have a lot of money or credits to burn, I recommend you stop if you are satistifed with a score of 850+ (which is around a day of training). Qualitatively, a score of ~ 850-870 is not that much worse compared to our final agent that achieves 900+, and I don’t want to burn your hard-earned money on cloud credits. To get 900+ it might take weeks (who said getting SOTA was easy? :). The final models are saved in log/*.json and you can test and view them the usual way.

Contributing

There are many cool ideas to try out – For instance, iterative training methods, transfer learning, intrinsic motivation, other environments.

A generative noisy pixel pendulum environment?

If you want to extend the code and try out new things, I recommend modifying the code and trying it out to solve a specific new environment, and not try to improve the code to work for multiple environments at the same time. I find that for research work, and when trying to solve difficult environments, specific custom modifications are usually required. You are welcome to submit a pull request with a self-contained subdirectory that is tailored for a specific challenging environment that you had attempted to solve, with instructions in a README.md file in your subdirectory.

Citation

If you found this code useful in an academic setting, please cite:

@incollection{ha2018worldmodels, title = {Recurrent World Models Facilitate Policy Evolution}, author = {Ha, David and Schmidhuber, J{\"u}rgen}, booktitle = {Advances in Neural Information Processing Systems 31}, pages = {2451--2463}, year = {2018}, publisher = {Curran Associates, Inc.}, url = {https://papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution}, note = "\url{https://worldmodels.github.io}", }

World Models

Tue, 27 Mar 2018 00:00:00 -0500

Can agents learn inside of their own dreams?

Redirecting to worldmodels.github.io, where the article resides.

Evolving Stable Strategies

Sun, 12 Nov 2017 00:00:00 -0600

Going for a ride.


GitHub

In the previous article, I have described a few evolution strategies (ES) algorithms that can optimise the parameters of a function without the need to explicitly calculate gradients. These algorithms can be applied to reinforcement learning (RL) problems to help find a suitable set of model parameters for a neural network agent. In this article, I will explore applying ES to some of these RL problems, and also highlight methods we can use to find policies that are more stable and robust.

Evolution Strategies for Reinforcement Learning

While RL algorithms require a reward signal to be given to the agent at every timestep, ES algorithms only care about the final cumulative reward that an agent gets at the end of its rollout in an environment. In many problems, we only know the outcome at the end of the task, such as whether the agent wins or loses, whether the robot arm picks up the object or not, or whether the agent has survived, and these are problems where ES may have an advantage over traditional RL. Below is a pseudo-code that encapsulates a rollout of an agent in an OpenAI Gym environment, where we only care about the cumulative reward:

def rollout(agent, env):
  obs = env.reset()
  done = False
  total_reward = 0
  while not done:
    a = agent.get_action(obs)
    obs, reward, done = env.step(a)
    total_reward += reward
  return total_reward

We can define rollout to be the objective function that maps the model parameters of an agent into its fitness score, and use an ES solver to find a suitable set of model parameters as described in the previous article:

env = gym.make('worlddomination-v0')

# use our favourite ES
solver = EvolutionStrategy()

while True:

  # ask the ES to give set of params
  solutions = solver.ask()

  # create array to hold the results
  fitlist = np.zeros(solver.popsize)

  # evaluate for each given solution
  for i in range(solver.popsize):

    # init the agent with a solution
    agent = Agent(solutions[i])

    # rollout env with this agent
    fitlist[i] = rollout(agent, env)

  # give scores results back to ES
  solver.tell(fitness_list)

  # get best param & fitness from ES
  bestsol, bestfit = solver.result()

  # see if our task is solved
  if bestfit > MY_REQUIREMENT:
    break

Deterministic and Stochastic Policies

Our agent takes the observation given to it by the environment as an input, and outputs an action at each timestep during a rollout inside the environment. We can model the agent however we want, and use methods from hard-coded rules, decision trees, linear functions to recurrent neural networks. In this article I use a simple feed-forward network with 2 hidden layers to map from an agent’s observation, a vector $x$ , directly to the actions, a vector $y$ :

$h_1 = f_h(W_1 \; x + b_1)$ $h_2 = f_h(W_2 \; h_1 + b_2)$ $y = f_{out}(W_{out} \; h_2 + b_{out})$

The activation functions $f_h$ , $f_{out}$ can be tanh, sigmoid, relu, or whatever we want to use. In all of my experiments I use tanh. For the output layer, sometimes we may want $f_{out}$ to be a pass-through function without nonlinearities. If we concatenate all the weight and bias parameters into a single vector called $W$ , we see that the above neural network is a deterministic function $y = F(x, W)$ . We can then use ES to find a solution $W$ using the search loop described earlier.

But what if we don’t want our agent’s policy to be deterministic? For certain tasks, even as simple as rock-paper-scissors, the optimal policy is a random action, so we want our agent to be able to learn a stochastic policy. One way to convert $y=F(x, W)$ into a stochastic policy is to make $W$ random. Each model parameter $w_i \in W$ can be a random value drawn from a normal distribution $N(\mu_i, \sigma_i)$ .

This type of stochastic network is called a Bayesian Neural Network. A Bayesian neural network is a neural network with a prior distribution on its weights. In this case, the model parameters we want to solve for, are the set of $\mu$ and $\sigma$ vectors, rather than the weights $W$ . During each forward pass of the network, a new $W$ is drawn from $N(\mu, \sigma I)$ . There are lots of interesting works in the literature applying Bayesian networks to many problems, and also addressing many challenges of training these networks. ES can also be used to directly find solutions for a stochastic policy by setting the solutions space be $\mu$ and $\sigma$ , rather than $W$ .

Stochastic policy networks are also popular in the RL literature. For example, in the Proximal Policy Optimization (PPO) algorithm, the final layer is a set of $\mu$ and $\sigma$ parameters and the action is sampled from $N(\mu, \sigma I)$ . Adding noise to parameters are also known to encourage the agent to explore the environment and escape from local optima. I find that for many tasks where we need an agent to explore, we do not need the entire $W$ to be random – just the bias is enough. For challenging locomotion tasks, such as the ones in the roboschool environment, I often need to use ES to find a stochastic policy where only the bias parameters are drawn from a normal distribution.

Evolving Robust Policies for Bipedal Walker

One of the areas where I found ES useful is for searching for robust policies. I want to control the tradeoff between data efficiency, and how robust the policy is over several random trials. To demonstrate this, I tested ES on a nice environment called BipedalWalkerHardcore-v2 created by Oleg Klimov using the Box2D Physics Engine, the same physics engine used in Angry Birds.

Evolution Strategy Variant + OpenAI Gym pic.twitter.com/t2R0QQ5qcH
— hardmaru (@hardmaru) July 23, 2017

Our agent solved BipedalWalkerHardcore-v2.

In this environment our agent has to learn a policy to walk across randomly generated terrain within the time limit without falling over. There are 24 inputs, consisting of 10 lidar sensors, angles and contacts. The agent is not given the absolute coordinates of where it is on the map. The action space is 4 continuous values controlling the torques of its 4 motors. The total reward calculation is based on the total distance achieved by the agent. Generally, if the agent completes a map, it will get score of 300+ points, although a small amount of points will be subtracted based on how much motor torque was applied, so energy usage is also a constraint.

BipedalWalkerHardcore-v2 defines solving the task as getting an average score of 300+ over 100 consecutive random trials. While it is relatively easy to train an agent to successfully walk across the map using an RL algorithm, it is difficult to get the agent to do so consistently and efficiently, making this task an interesting challenge. To my knowledge, my agent is the only solution known to solve this task so far (as of October 2017).

Early stages. Learning to walk.

Learns to correct errors, but still slow ...

Because the terrain map is randomly generated for each trial, sometimes we may end up with an easy terrain, or sometimes a very difficult terrain. We don’t want our natural selection process to allow agents with weak policies who had gotten lucky with an easy map to advance to the next generation. We also want to give agents with good policies a chance to redeem themselves. So what I ended up doing, is to define an agent’s episode, as the average of 16 random rollouts, and use the average of the cumulative rewards over 16 rollouts as its fitness score.

Another way to look at this is to see that even though we are testing the agent over 100 trials, we usually train it on single trials, so the test-task is not the same as the training-task we are optimising for. By averaging each agent in the population multiple times in a stochastic environment, we narrow the gap between our training set and the test set. If we can overfit to the training set, we might as well overfit to the test set, since that’s an okay thing to do in RL :)

Of course, the data efficiency of our algorithm is now 16x worse, but the final policy is a lot more robust. When I tested the final policy over 100 consecutive random trials, we got an average score of over 300 points required to solve this environment. Without this averaging method, the best agent can only obtain an average score of $\sim$ 220 to 230 over 100 trials. To my knowledge, this is the first solution that solves this environment (as of October 2017).

Winning solutions evolved using PEPG using average-of-16 runs per episode.

I also used PPO, a state-of-the-art policy gradient algorithm for RL, and tried to tune it to the best of my ability to perform well on this task. In the end, I was only able to get PPO to achieve average scores of $\sim$ 240 to 250 over 100 random trials. But I’m sure someone else will be able to use PPO or another RL algorithm to solve this environment in the future. (Please let me know if you do so!)

Update (Jan 2018): dgriff777 was able to use a continuous version of A3C+LSTM with 4 stack frames as the input to train BipedalWalkerHardcore-v2 to obtain a score of 300 over 100 random trials. He provided this awesome implementation of his pytorch model on GitHub.

The ability to control the tradeoff between data efficiency and policy robustness is quite powerful, and useful in the real world where we need safe policies. In theory, with enough compute, we could have even averaged over of the required 100 rollouts and optimised our Bipedal Walker directly to the requirements. Professional engineers are often required to have their designs satisfy specific Quality Assurance guarantees and meet certain safety factors. We need to be able to take into account such safety factors when we train agents to learn policies that may affect the real world.

Here are a few other solutions that ES discovered:

CMA-ES solution

OpenAI-ES solution

I also trained the agent with a stochastic policy network with high initial noise parameters, so the agent sees noise everywhere, and even its actions are noisy. It resulted in the agent learning the task despite not being confident of its input and outputs being accurate (this agent couldn’t get a score of 300+ though):

Bipedal walker using a stochastic policy.

Kuka Robot Arm Grasping

I also tried to apply ES with this averaging technique on a simplified Kuka robot arm grasping task. This environment is available in the pybullet environment. The Kuka model used in the simulation is designed to be similar to a real Kuka robot arm. In this simplified task, the agent is given the coordinates of the object.

More advanced RL environments may require the agent to infer an action directly from pixel inputs, but we could in principle combine this simplified model with a pre-trained convnet that gives us an estimate of the coordinates as well.

Robot arm grasping task using a stochastic policy.

The agent obtains a score of 10000 if it successfully picks up the object, and 0 otherwise. Some points are deducted for energy usage. By averaging a sparse reward over 16 random trials, we can get the ES to optimise for robustness. However, in the end, I was able to get policies that can pick up the object only $\sim$ 70 to 75% of the time with both deterministic and stochastic policies. There is still room for improvement.

Getting a Minitaur to Learn a Multiple Tasks

Learning to perform multiple difficult tasks at the same time make us better at performing individual tasks. For example, Shaolin monks who lift weights while standing on a pole will be able to balance better without the weights. Learning to not spill a cup of water while cruising a car at 80mph in the mountains will make the driver a better illegal street racer. We can also train agents to perform multiple tasks to make them learn more stable policies.

Shaolin Agents.

Learning to drift.

This recent work on self-playing agents demonstrated that agents who learn difficult tasks such as Sumo wrestling (a sport that require many skills) are able to also perform easier tasks, like withstanding wind while walking, without the need for further training. Erwin Coumans recently tried to experiment with adding a duck on top of a Minitaur learning to walk ahead. If the duck fell, the Minitaur would also fail, so the hope is that these types of task augmentation will help transfer learned policies from simulation over to the real Minitaur. I took one of his examples and experimented with training the Minitaur and duck combination using ES.

CMA-ES walking policy in pybullet.

Real Minitaur from Ghost Robotics.

The Minitaur model in pybullet is designed to mimic the real physical Minitaur. However, a policy trained on a perfect simulation environment usually fails in the real world. It may not even generalise to small augmentations of the task inside the simulation. For example, in the figure above is a Minitaur trained to walk ahead (using CMA-ES), but we see that this policy is not always able to carry a duck across the room when we put a duck on top of it inside of the simulation.

Walking policy works with duck.

Policy trained on duck.

The policy learned from the pure walking task still works to some degree even when the duck is deployed, meaning that the addition of the duck wasn’t so difficult. The duck has a flat stable bottom so it wasn’t too difficult for the Minitaur to keep the duck from falling off its back. I tried to replace the duck with a ball to make the task much harder.

Learning to cheat.

However, replacing the duck with a ball didn’t immediately result in a stable balancing policy. Instead, CMA-ES found a policy that still technically carried the ball across the floor by first having the ball slide into a hole made for its legs, and then carrying the ball inside this hole. The lesson learned here is that an objective-driven search algorithm will learn to take advantage of any design flaws in the environment and exploit them to reach its objective.

Stochastic policy trained with ball.

Same policy with duck.

After making the ball smaller, CMA-ES was able to find a stochastic policy that can walk and balance the ball at the same time. This policy also transferred back to the easier duck task. In the future, I hope these type of task augmentation techniques will be useful for transfer learning to real robots.

ESTool

One of the big selling points of ES is that it is easy to parallelise the computation using several workers running on different threads on different CPU cores, or even on different machines. Python’s multiprocessing makes it simple to launch parallel processes. I prefer to use Message Passing Interface (MPI) with mpi4py to launch separate python processes for each job. This allows us to get around the global interpreter lock, and also gives me confidence that each process has its own sandboxed numpy and gym instances which is important when it comes to seeding random number generators.

Roboschool Hopper, Walker, Ant.

Roboschool Reacher.

Agents evolved using estool on various roboschool tasks.

I have implemented a simple tool called estool that uses the es.py library described in the previous article to train simple feed-forward policy networks to perform continuous control RL tasks written with a gym interface. I have used estool tool to easily train all of the experiments described earlier, as well as various other continuous control tasks inside gym and roboschool. estool uses MPI for distributed processing so it shouldn’t require too much work to distribute workers over multiple machines.

ESTool with pybullet

GitHub repo

In addition to the environments that come with gym and roboschool, estool works well with most pybullet gym environments. It is also easy to build custom pybullet environments by modifying existing environments. For example, I was able to make the Minitaur with ball environment (in the custom_envs directory of the repo) without much effort, and being able to tinker with the environment makes it easier to try out new ideas. If you want to incorporate 3D models from other software packages like ROS or Blender, you can try building new and interesting pybullet environments and challenge others to try to solve them.

Many models and environments in pybullet, such as the Kuka robot arm and the Minitaur, are modelled to be similar to the real robot as part of current exciting transfer learning research efforts. In fact, many of these recent cutting edge research papers are using pybullet to conduct transfer learning experiments.

You don’t need an expensive Minitaur or Kuka robot arm to play with sim-to-real experiments though. There is a racecar model inside pybullet that is modelled after the MIT racecar open source hardware kit. There’s even a pybullet environment that mounts a virtual camera onto the virtual racecar to give the agent a virtual pixel screen as an input observation.

Let’s try the easier version first, where the racecar simply needs to learn a policy to move towards a giant ball. In the RacecarBulletEnv-v0 environment, the agent gets the relative coordinates of the ball as an input, and outputs continuous actions that control the motor speed and steering direction. The task is simple enough that it only takes 5 minutes (50 generations) on a 2014 Macbook Pro (with an 8-core CPU) to train. Using estool, the command below will launch the training job on eight processes and assign each process 4 jobs, to get a total of 32 workers, using CMA-ES to evolve the policies:

python train.py bullet_racecar -o cma -n 8 -t 4

The training progress, as well as the model parameters found will be stored in the log subdirectory. We can run this command to visualise an agent inside the environment using the best policy found:

python model.py bullet_racecar log/bullet_racecar.cma.1.32.best.json

pybullet racecar environment, based on the MIT Racecar.

In the simulation, we can use the mouse cursor to move the ball around, and even move the racecar around if we want to interact with it.

The IPython notebook plot_training_progress.ipynb can visualise the training history per generation of the racecar agents. At each generation, we can see the best score, the worse score, and the average score across the entire population.

Standard locomotion tasks similar to those in roboschool, such as Inverted Pendulum, Hopper, Walker, HalfCheetah, Ant, and Humanoid are also available in pybullet. I found a policy for pybullet’s ant that gets to a score of 3000 within hours on a multi-core machine with a population size of 256, using PEPG:

python train.py bullet_ant -o pepg -n 64 -t 4

Example rollout of AntBulletEnv. We can still save rollouts as an .mp4 video using gym.wrappers.Monitor

Summary

In this article, I discussed using ES to find policies for a feed-forward neural network agent to perform various continuous control RL tasks defined by a gym environment interface. I described the estool that allowed me to quickly try different ES algorithms with various settings in a distributed processing environment using the MPI framework.

So far, I have only discussed methods for training an agent by having it learn a policy from trial-and-error in the environment. This form of training from scratch is referred to as model-free reinforcement learning. In the next article (if I ever get to writing it), I will discuss more about model-based learning, where our agent will learn to exploit a previously learned model to accomplish a given task. And yes, I will still be using evolution.

Citation

If you find this work useful, please cite it as:

@article{ha2017evolving, title = "Evolving Stable Strategies", author = "Ha, David", journal = "blog.otoro.net", year = "2017", url = "https://blog.otoro.net/2017/11/12/evolving-stable-strategies/" }

Acknowledgements

I want to thank Erwin Coumans for writing all these great environments, and also for helping me work on making ESTool better. Great research cannot be done without great tools.

In the end, it all comes to choices to turn stumbling blocks into stepping stones.

Interesting Links

“Fires of a Revolution” Incredible Fast Piano Music (EPIC)

A Visual Guide to Evolution Strategies

ESTool

Stable or Robust? What’s the Difference?

OpenAI Gym Docs

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Edward, A library for probabilistic modeling, inference, and criticism

History of Bayesian Neural Networks

BipedalWalkerHardcore-v2

roboschool

pybullet

Emergent Complexity via Multi-Agent Competition

GraspGAN

A Visual Guide to Evolution Strategies

Sun, 29 Oct 2017 00:00:00 -0500

Survival of the fittest.

In this post I explain how evolution strategies (ES) work with the aid of a few visual examples. I try to keep the equations light, and I provide links to original articles if the reader wishes to understand more details. This is the first post in a series of articles, where I plan to show how to apply these algorithms to a range of tasks from MNIST, OpenAI Gym, Roboschool to PyBullet environments.

Introduction

Neural network models are highly expressive and flexible, and if we are able to find a suitable set of model parameters, we can use neural nets to solve many challenging problems. Deep learning’s success largely comes from the ability to use the backpropagation algorithm to efficiently calculate the gradient of an objective function over each model parameter. With these gradients, we can efficiently search over the parameter space to find a solution that is often good enough for our neural net to accomplish difficult tasks.

However, there are many problems where the backpropagation algorithm cannot be used. For example, in reinforcement learning (RL) problems, we can also train a neural network to make decisions to perform a sequence of actions to accomplish some task in an environment. However, it is not trivial to estimate the gradient of reward signals given to the agent in the future to an action performed by the agent right now, especially if the reward is realised many timesteps in the future. Even if we are able to calculate accurate gradients, there is also the issue of being stuck in a local optimum, which exists many for RL tasks.

Stuck in a local optimum.

A whole area within RL is devoted to studying this credit-assignment problem, and great progress has been made in recent years. However, credit assignment is still difficult when the reward signals are sparse. In the real world, rewards can be sparse and noisy. Sometimes we are given just a single reward, like a bonus check at the end of the year, and depending on our employer, it may be difficult to figure out exactly why it is so low. For these problems, rather than rely on a very noisy and possibly meaningless gradient estimate of the future to our policy, we might as well just ignore any gradient information, and attempt to use black-box optimisation techniques such as genetic algorithms (GA) or ES.

OpenAI published a paper called Evolution Strategies as a Scalable Alternative to Reinforcement Learning where they showed that evolution strategies, while being less data efficient than RL, offer many benefits. The ability to abandon gradient calculation allows such algorithms to be evaluated more efficiently. It is also easy to distribute the computation for an ES algorithm to thousands of machines for parallel computation. By running the algorithm from scratch many times, they also showed that policies discovered using ES tend to be more diverse compared to policies discovered by RL algorithms.

I would like to point out that even for the problem of identifying a machine learning model, such as designing a neural net’s architecture, is one where we cannot directly compute gradients. While RL, Evolution, GA etc., can be applied to search in the space of model architectures, in this post, I will focus only on applying these algorithms to search for parameters of a pre-defined model.

What is an Evolution Strategy?

Two-dimensional Rastrigin function has many local optima (Source: Wikipedia).

The diagrams below are top-down plots of shifted 2D Schaffer and Rastrigin functions, two of several simple toy problems used for testing continuous black-box optimisation algorithms. Lighter regions of the plots represent higher values of $F(x, y)$ . As you can see, there are many local optimums in this function. Our job is to find a set of model parameters $(x, y)$ , such that $F(x, y)$ is as close as possible to the global maximum.

Schaffer-2D Function

Rastrigin-2D Function

Although there are many definitions of evolution strategies, we can define an evolution strategy as an algorithm that provides the user a set of candidate solutions to evaluate a problem. The evaluation is based on an objective function that takes a given solution and returns a single fitness value. Based on the fitness results of the current solutions, the algorithm will then produce the next generation of candidate solutions that is more likely to produce even better results than the current generation. The iterative process will stop once the best known solution is satisfactory for the user.

Given an evolution strategy algorithm called EvolutionStrategy, we can use in the following way:

solver = EvolutionStrategy()

while True:

# ask the ES to give us a set of candidate solutions
solutions = solver.ask()

# create an array to hold the fitness results.
fitness_list = np.zeros(solver.popsize)

  # evaluate the fitness for each given solution.
  for i in range(solver.popsize):
    fitness_list[i] = evaluate(solutions[i])

# give list of fitness results back to ES
solver.tell(fitness_list)

# get best parameter, fitness from ES
best_solution, best_fitness = solver.result()

if best_fitness > MY_REQUIRED_FITNESS:
break

Although the size of the population is usually held constant for each generation, they don’t need to be. The ES can generate as many candidate solutions as we want, because the solutions produced by an ES are sampled from a distribution whose parameters are being updated by the ES at each generation. I will explain this sampling process with an example of a simple evolution strategy.

Simple Evolution Strategy

One of the simplest evolution strategy we can imagine will just sample a set of solutions from a Normal distribution, with a mean $\mu$ and a fixed standard deviation $\sigma$ . In our 2D problem, $\mu = (\mu_x, \mu_y)$ and $\sigma = (\sigma_x, \sigma_y)$ . Initially, $\mu$ is set at the origin. After the fitness results are evaluated, we set $\mu$ to the best solution in the population, and sample the next generation of solutions around this new mean. This is how the algorithm behaves over 20 generations on the two problems mentioned earlier:

In the visualisation above, the green dot indicates the mean of the distribution at each generation, the blue dots are the sampled solutions, and the red dot is the best solution found so far by our algorithm.

This simple algorithm will generally only work for simple problems. Given its greedy nature, it throws away all but the best solution, and can be prone to be stuck at a local optimum for more complicated problems. It would be beneficial to sample the next generation from a probability distribution that represents a more diverse set of ideas, rather than just from the best solution from the current generation.

Simple Genetic Algorithm

One of the oldest black-box optimisation algorithms is the genetic algorithm. There are many variations with many degrees of sophistication, but I will illustrate the simplest version here.

The idea is quite simple: keep only 10% of the best performing solutions in the current generation, and let the rest of the population die. In the next generation, to sample a new solution is to randomly select two solutions from the survivors of the previous generation, and recombine their parameters to form a new solution. This crossover recombination process uses a coin toss to determine which parent to take each parameter from. In the case of our 2D toy function, our new solution might inherit $x$ or $y$ from either parents with 50% chance. Gaussian noise with a fixed standard deviation will also be injected into each new solution after this recombination process.

The figure above illustrates how the simple genetic algorithm works. The green dots represent members of the elite population from the previous generation, the blue dots are the offsprings to form the set of candidate solutions, and the red dot is the best solution.

Genetic algorithms help diversity by keeping track of a diverse set of candidate solutions to reproduce the next generation. However, in practice, most of the solutions in the elite surviving population tend to converge to a local optimum over time. There are more sophisticated variations of GA out there, such as CoSyNe, ESP, and NEAT, where the idea is to cluster similar solutions in the population together into different species, to maintain better diversity over time.

Covariance-Matrix Adaptation Evolution Strategy (CMA-ES)

A shortcoming of both the Simple ES and Simple GA is that our standard deviation noise parameter is fixed. There are times when we want to explore more and increase the standard deviation of our search space, and there are times when we are confident we are close to a good optima and just want to fine tune the solution. We basically want our search process to behave like this:

Amazing isn’it it? The search process shown in the figure above is produced by Covariance-Matrix Adaptation Evolution Strategy (CMA-ES). CMA-ES an algorithm that can take the results of each generation, and adaptively increase or decrease the search space for the next generation. It will not only adapt for the mean $\mu$ and sigma $\sigma$ parameters, but will calculate the entire covariance matrix of the parameter space. At each generation, CMA-ES provides the parameters of a multi-variate normal distribution to sample solutions from. So how does it know how to increase or decrease the search space?

Before we discuss its methodology, let’s review how to estimate a covariance matrix. This will be important to understand CMA-ES’s methodology later on. If we want to estimate the covariance matrix of our entire sampled population of size of $N$ , we can do so using the set of equations below to calculate the maximum likelihood estimate of a covariance matrix $C$ . We first calculate the means of each of the $x_i$ and $y_i$ in our population:

$\mu_x = \frac{1}{N} \sum_{i=1}^{N}x_i,$ $\mu_y = \frac{1}{N} \sum_{i=1}^{N}y_i.$

The terms of the 2x2 covariance matrix $C$ will be:

$\sigma_x^2 = \frac{1}{N} \sum_{i=1}^{N}(x_i - \mu_x)^2,$ $\sigma_y^2 = \frac{1}{N} \sum_{i=1}^{N}(y_i - \mu_y)^2,$ $\sigma_{xy} = \frac{1}{N} \sum_{i=1}^{N}(x_i - \mu_x)(y_i - \mu_y).$

Of course, these resulting mean estimates $\mu_x$ and $\mu_y$ , and covariance terms $\sigma_x$ , $\sigma_y$ , $\sigma_{xy}$ will just be an estimate to the actual covariance matrix that we originally sampled from, and not particularly useful to us.

CMA-ES modifies the above covariance calculation formula in a clever way to make it adapt well to an optimisation problem. I will go over how it does this step-by-step. Firstly, it focuses on the best $N_{best}$ solutions in the current generation. For simplicity let’s set $N_{best}$ to be the best 25% of solutions. After sorting the solutions based on fitness, we calculate the mean $\mu^{(g+1)}$ of the next generation $(g+1)$ as the average of only the best 25% of the solutions in current population $(g)$ , i.e.:

$\mu_x^{(g+1)} = \frac{1}{N_{best}} \sum_{i=1}^{N_{best}}x_i,$ $\mu_y^{(g+1)} = \frac{1}{N_{best}} \sum_{i=1}^{N_{best}}y_i.$

Next, we use only the best 25% of the solutions to estimate the covariance matrix $C^{(g+1)}$ of the next generation, but the clever hack here is that it uses the current generation’s $\mu^{(g)}$ , rather than the updated $\mu^{(g+1)}$ parameters that we had just calculated, in the calculation:

$\sigma_x^{2, (g+1)} = \frac{1}{N_{best}} \sum_{i=1}^{N_{best}}(x_i - \mu_x^{(g)})^2,$ $\sigma_y^{2, (g+1)} = \frac{1}{N_{best}} \sum_{i=1}^{N_{best}}(y_i - \mu_y^{(g)})^2,$ $\sigma_{xy}^{(g+1)} = \frac{1}{N_{best}} \sum_{i=1}^{N_{best}}(x_i - \mu_x^{(g)})(y_i - \mu_y^{(g)}).$

Armed with a set of $\mu_x$ , $\mu_y$ , $\sigma_x$ , $\sigma_y$ , and $\sigma_{xy}$ parameters for the next generation $(g+1)$ , we can now sample the next generation of candidate solutions.

Below is a set of figures to visually illustrate how it uses the results from the current generation $(g)$ to construct the solutions in the next generation $(g+1)$ :

Step 1

Step 2

Step 3

Step 4

Calculate the fitness score of each candidate solution in generation $(g)$ .
Isolates the best 25% of the population in generation $(g)$ , in purple.
Using only the best solutions, along with the mean $\mu^{(g)}$ of the current generation (the green dot), calculate the covariance matrix $C^{(g+1)}$ of the next generation.
Sample a new set of candidate solutions using the updated mean $\mu^{(g+1)}$ and covariance matrix $C^{(g+1)}$ .

Let’s visualise the scheme one more time, on the entire search process on both problems:

Because CMA-ES can adapt both its mean and covariance matrix using information from the best solutions, it can decide to cast a wider net when the best solutions are far away, or narrow the search space when the best solutions are close by. My description of the CMA-ES algorithm for a 2D toy problem is highly simplified to get the idea across. For more details, I suggest reading the CMA-ES Tutorial prepared by Nikolaus Hansen, the author of CMA-ES.

This algorithm is one of the most popular gradient-free optimisation algorithms out there, and has been the algorithm of choice for many researchers and practitioners alike. The only real drawback is the performance if the number of model parameters we need to solve for is large, as the covariance calculation is $O(N^2)$ , although recently there has been approximations to make it $O(N)$ . CMA-ES is my algorithm of choice when the search space is less than a thousand parameters. I found it still usable up to ~ 10K parameters if I’m willing to be patient.

Natural Evolution Strategies

Imagine if you had built an artificial life simulator, and you sample a different neural network to control the behavior of each ant inside an ant colony. Using the Simple Evolution Strategy for this task will optimise for traits and behaviours that benefit individual ants, and with each successive generation, our population will be full of alpha ants who only care about their own well-being.

Instead of using a rule that is based on the survival of the fittest ants, what if you take an alternative approach where you take the sum of all fitness values of the entire ant population, and optimise for this sum instead to maximise the well-being of the entire ant population over successive generations? Well, you would end up creating a Marxist utopia.

A perceived weakness of the algorithms mentioned so far is that they discard the majority of the solutions and only keep the best solutions. Weak solutions contain information about what not to do, and this is valuable information to calculate a better estimate for the next generation.

Many people who studied RL are familiar with the REINFORCE paper. In this 1992 paper, Williams outlined an approach to estimate the gradient of the expected rewards with respect to the model parameters of a policy neural network. This paper also proposed using REINFORCE as an Evolution Strategy, in Section 6 of the paper. This special case of REINFORCE-ES was expanded later on in Parameter-Exploring Policy Gradients (PEPG, 2009) and Natural Evolution Strategies (NES, 2014).

In this approach, we want to use all of the information from each member of the population, good or bad, for estimating a gradient signal that can move the entire population to a better direction in the next generation. Since we are estimating a gradient, we can also use this gradient in a standard SGD update rule typically used for deep learning. We can even use this estimated gradient with Momentum SGD, RMSProp, or Adam if we want to.

The idea is to maximise the expected value of the fitness score of a sampled solution. If the expected result is good enough, then the best performing member within a sampled population will be even better, so optimising for the expectation might be a sensible approach. Maximising the expected fitness score of a sampled solution is almost the same as maximising the total fitness score of the entire population.

If $z$ is a solution vector sampled from a probability distribution function $\pi(z, \theta)$ , we can define the expected value of the objective function $F$ as:

$J(\theta) = E_{\theta}[F(z)] = \int F(z) \; \pi(z, \theta) \; dz,$

where $\theta$ are the parameters of the probability distribution function. For example, if $\pi$ is a normal distribution, then $\theta$ would be $\mu$ and $\sigma$ . For our simple 2D toy problems, each ensemble $z$ is a 2D vector $(x, y)$ .

The NES paper contains a nice derivation of the gradient of $J(\theta)$ with respect to $\theta$ . Using the same log-likelihood trick as in the REINFORCE algorithm allows us to calculate the gradient of $J(\theta)$ :

$\nabla_{\theta} J(\theta) = E_{\theta}[ \; F(z) \; \nabla_{\theta} \log \pi(z, \theta) \; ].$

In a population size of $N$ , where we have solutions $z^1$ , $z^2$ , $...$ $z^N$ , we can estimate this gradient as a summation:

$\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \; F(z^i) \; \nabla_{\theta} \log \pi(z^i, \theta).$

With this gradient $\nabla_{\theta} J(\theta)$ , we can use a learning rate parameter $\alpha$ (such as 0.01) and start optimising the $\theta$ parameters of pdf $\pi$ so that our sampled solutions will likely get higher fitness scores on the objective function $F$ . Using SGD (or Adam), we can update $\theta$ for the next generation:

$\theta \rightarrow \theta + \alpha \nabla_{\theta} J(\theta),$

and sample a new set of candidate solutions $z$ from this updated pdf, and continue until we arrive at a satisfactory solution.

In Section 6 of the REINFORCE paper, Williams derived closed-form formulas of the gradient $\nabla_{\theta} \log \pi(z^i, \theta)$ , for the special case where $\pi(z, \theta)$ is a factored multi-variate normal distribution (i.e., the correlation parameters are zero). In this special case, $\theta$ are the $\mu$ and $\sigma$ vectors. Therefore, each element of a solution can be sampled from a univariate normal distribution $z_j \sim N(\mu_j, \sigma_j)$ .

The closed-form formulas for $\nabla_{\theta} \log N(z^i, \theta)$ , for each individual element of vector $\theta$ on each solution $i$ in the population can be derived as:

$\nabla_{\mu_{j}} \log N(z^i, \mu, \sigma) = \frac{z_j^i - \mu_j}{\sigma_j^2},$ $\nabla_{\sigma_{j}} \log N(z^i, \mu, \sigma) = \frac{(z_j^i - \mu_j)^2 - \sigma_j^2}{\sigma_j^3}.$

For clarity, I use the index of $j$ , to count across parameter space, and this is not to be confused with superscript $i$ , used to count across each sampled member of the population. For our 2D problems, $z_1 = x$ , $z_2 = y$ , $\mu_1 = \mu_x$ , $\mu_2 = \mu_y$ , $\sigma_1 = \sigma_x$ , $\sigma_2 = \sigma_y$ in this context.

These two formulas can be plugged back into the approximate gradient formula to derive explicit update rules for $\mu$ and $\sigma$ . In the papers mentioned above, they derived more explicit update rules, incorporated a baseline, and introduced other tricks such as antithetic sampling in PEPG, which is what my implementation is based on. NES proposed incorporating the inverse of the Fisher Information Matrix into the gradient update rule. But the concept is basically the same as other ES algorithms, where we update the mean and standard deviation of a multi-variate normal distribution at each new generation, and sample a new set of solutions from the updated distribution. Below is a visualization of this algorithm in action, following the formulas described above:

We see that this algorithm is able to dynamically change the $\sigma$ ’s to explore or fine tune the solution space as needed. Unlike CMA-ES, there is no correlation structure in our implementation, so we don’t get the diagonal ellipse samples, only the vertical or horizontal ones, although in principle we can derive update rules to incorporate the entire covariance matrix if we needed to, at the expense of computational efficiency.

I like this algorithm because like CMA-ES, the $\sigma$ ’s can adapt so our search space can be expanded or narrowed over time. Because the correlation parameter is not used in this implementation, the efficiency of the algorithm is $O(N)$ so I use PEPG if the performance of CMA-ES becomes an issue. I usually use PEPG when the number of model parameters exceed several thousand.

OpenAI Evolution Strategy

In OpenAI’s paper, they implement an evolution strategy that is a special case of the REINFORCE-ES algorithm outlined earlier. In particular, $\sigma$ is fixed to a constant number, and only the $\mu$ parameter is updated at each generation. Below is how this strategy looks like, with a constant $\sigma$ parameter:

In addition to the simplification, this paper also proposed a modification of the update rule that is suitable for parallel computation across different worker machines. In their update rule, a large grid of random numbers have been pre-computed using a fixed seed. By doing this, each worker can reproduce the parameters of every other worker over time, and each worker needs only to communicate a single number, the final fitness result, to all of the other workers. This is important if we want to scale evolution strategies to thousands or even a million workers located on different machines, since while it may not be feasible to transmit an entire solution vector a million times at each generation update, it may be feasible to transmit only the final fitness results. In the paper, they showed that by using 1440 workers on Amazon EC2 they were able to solve the Mujoco Humanoid walking task in ~ 10 minutes.

I think in principle, this parallel update rule should work with the original algorithm where they can also adapt $\sigma$ , but perhaps in practice, they wanted to keep the number of moving parts to a minimum for large-scale parallel computing experiments. This inspiring paper also discussed many other practical aspects of deploying ES for RL-style tasks, and I highly recommend going through it to learn more.

Fitness Shaping

Most of the algorithms above are usually combined with a fitness shaping method, such as the rank-based fitness shaping method I will discuss here. Fitness shaping allows us to avoid outliers in the population from dominating the approximate gradient calculation mentioned earlier:

$\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \; F(z^i) \; \nabla_{\theta} \log \pi(z^i, \theta).$

If a particular $F(z^m)$ is much larger than other $F(z^i)$ in the population, then the gradient might become dominated by this outliers and increase the chance of the algorithm being stuck in a local optimum. To mitigate this, one can apply a rank transformation of the fitness. Rather than use the actual fitness function, we would rank the results and use an augmented fitness function which is proportional to the solution’s rank in the population. Below is a comparison of what the original set of fitness may look like, and what the ranked fitness looks like:

What this means is supposed we have a population size of 101. We would evaluate each population to the actual fitness function, and then sort the solutions based by their fitness. We will assign an augmented fitness value of -0.50 to the worse performer, -0.49 to the second worse solution, …, 0.49 to the second best solution, and finally a fitness value of 0.50 to the best solution. This augmented set of fitness values will be used to calculate the gradient update, instead of the actual fitness values. In a way, it is a similar to just applying Batch Normalization to the results, but more direct. There are alternative methods for fitness shaping but they all basically give similar results in the end.

I find fitness shaping to be very useful for RL tasks if the objective function is non-deterministic for a given policy network, which is often the cases on RL environments where maps are randomly generated and various opponents have random policies. It is less useful for optimising for well-behaved functions that are deterministic, and the use of fitness shaping can sometimes slow down the time it takes to find a good solution.

MNIST

Although ES might be a way to search for more novel solutions that are difficult for gradient-based methods to find, it still vastly underperforms gradient-based methods on many problems where we can calculate high quality gradients. For instance, only an idiot would attempt to use a genetic algorithm for image classification. But sometimes such people do exist in the world, and sometimes these explorations can be fruitful!

Since all ML algorithms should be tested on MNIST, I also tried to apply these various ES algorithms to find weights for a small, simple 2-layer convnet used to classify MNIST, just to see where we stand compared to SGD. The convnet only has ~ 11k parameters so we can accommodate the slower CMA-ES algorithm. The code and the experiments are available here.

Below are the results for various ES methods, using a population size of 101, over 300 epochs. We keep track of the model parameters that performed best on the entire training set at the end of each epoch, and evaluate this model once on the test set after 300 epochs. It is interesting how sometimes the test set’s accuracy is higher than the training set for the models that have lower scores.

Method	Train Set	Test Set
Adam (BackProp) Baseline	99.8	98.9
Simple GA	82.1	82.4
CMA-ES	98.4	98.1
OpenAI-ES	96.0	96.2
PEPG	98.5	98.0

We should take these results with a grain of salt, since they are based on a single run, rather than the average of 5-10 runs. The results based on a single-run seem to indicate that CMA-ES is the best at the MNIST task, but the PEPG algorithm is not that far off. Both of these algorithms achieved ~ 98% test accuracy, 1% lower than the SGD/ADAM baseline. Perhaps the ability to dynamically alter its covariance matrix, and standard deviation parameters over each generation allowed it to fine-tune its weights better than OpenAI’s simpler variation.

Try It Yourself

There are probably open source implementations of all of the algorithms described in this article. The author of CMA-ES, Nikolaus Hansen, has been maintaining a numpy-based implementation of CMA-ES with lots of bells and whistles. His python implementation introduced me to the training loop interface described earlier. Since this interface is quite easy to use, I also implemented the other algorithms such as Simple Genetic Algorithm, PEPG, and OpenAI’s ES using the same interface, and put it in a small python file called es.py, and also wrapped the original CMA-ES library in this small library. This way, I can quickly compare different ES algorithms by just changing one line:

import es

#solver = es.SimpleGA(...)
#solver = es.PEPG(...)
#solver = es.OpenES(...)
solver = es.CMAES(...)

while True:

solutions = solver.ask()

fitness_list = np.zeros(solver.popsize)

for i in range(solver.popsize):
fitness_list[i] = evaluate(solutions[i])

solver.tell(fitness_list)

result = solver.result()

if result[1] > MY_REQUIRED_FITNESS:
break

You can look at es.py on GitHub and the IPython notebook examples using the various ES algorithms.

In this IPython notebook that accompanies es.py, I show how to use the ES solvers in es.py to solve a 100-Dimensional version of the Rastrigin function with even more local optimum points. The 100-D version is somewhat more challenging than the trivial 2D version used to produce the visualizations in this article. Below is a comparison of the performance for various algorithms discussed:

On this 100-D Rastrigin problem, none of the optimisers got to the global optimum solution, although CMA-ES comes close. CMA-ES blows everything else away. PEPG is in 2nd place, and OpenAI-ES / Genetic Algorithm falls behind. I had to use an annealing schedule to gradually lower $\sigma$ for OpenAI-ES to make it perform better for this task.

Final solution that CMA-ES discovered for 100-D Rastrigin function.
Global optimal solution is a 100-dimensional vector of exactly 10.

What’s Next?

so proud of my little dude ... pic.twitter.com/j5j61vQxP0
— hardmaru (@hardmaru) July 23, 2017

In the next article, I will look at applying ES to other experiments and more interesting problems. Please check it out!

Citation

If you find this work useful, please cite it as:

@article{ha2017visual, title = "A Visual Guide to Evolution Strategies", author = "Ha, David", journal = "blog.otoro.net", year = "2017", url = "https://blog.otoro.net/2017/10/29/visual-evolution-strategies/" }

References and Other Links

Below are a few links to information related to evolutionary computing which I found useful or inspiring.

Image Credits of Lemmings Jumping off a Cliff. Your results may vary when investing in ICOs.

CMA-ES: Official Reference Implementation on GitHub, Tutorial, Original CMA-ES Paper from 2001, Overview Slides

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (REINFORCE), 1992.

Parameter-Exploring Policy Gradients, 2009.

Natural Evolution Strategies, 2014.

Evolution Strategies as a Scalable Alternative to Reinforcement Learning, OpenAI, 2017.

Risto Miikkulainen’s Slides on Neuroevolution.

A Neuroevolution Approach to General Atari Game Playing, 2013.

Kenneth Stanley’s Talk on Why Greatness Cannot Be Planned: The Myth of the Objective, 2015.

Neuroevolution: A Different Kind of Deep Learning. The quest to evolve neural networks through evolutionary algorithms.

Compressed Network Search Finds Complex Neural Controllers with a Million Weights.

Karl Sims Evolved Virtual Creatures, 1994.

Evolved Step Climbing Creatures.

Super Mario World Agent Mario I/O, Mario Kart 64 Controller using using NEAT Algorithm.

Ingo Rechenberg, the inventor of Evolution Strategies.

A Tutorial on Differential Evolution with Python.

My Previous Evolutionary Projects

PathNet: Evolution Channels Gradient Descent in Super Neural Networks

Neural Network Evolution Playground with Backprop NEAT

Evolved Neural Art Gallery using CPPN Implementation

Creatures Avoiding Planks

Neural Slime Volleyball

Evolution of Inverted Double Pendulum Swing Up Controller

Teaching Machines to Draw

Fri, 19 May 2017 00:00:00 -0500

Latent space interpolation of various vector drawings produced by sketch-rnn.


GitHub

This is an updated version of my article, cross-posted on the Google Research Blog. Instructions on using the sketch-rnn model is available at Google Brain Magenta Project. Link to our paper, “A Neural Representation of Sketch Drawings”. This article has also been translated to Simplified Chinese.

Introduction

Vector drawings produced by our model.

Recently, there have been major advancements in generative modelling of images using neural networks as a generative tool. While there is a already a large body of existing work on generative modelling of images using neural networks, most of the work thus far has been targeted towards modelling low resolution, pixel images.

Humans, however, do not understand the world as a grid of pixels, but rather develop abstract concepts to represent what we see. From a young age, we develop the ability to communicate what we see by drawing on a piece of paper with a pencil. In this way we learn to express a sequential, vector representation of an image as a short sequence of strokes. In this work, we investigate an alternative to traditional pixel image modelling approaches, and propose a generative model for vector images.

Humans learn to draw sequentially. Designers rely on vector graphics. Yet most ML Research focus only on generative models for pixel images. pic.twitter.com/3VHe3HmFCi
— hardmaru (@hardmaru) May 20, 2017

Children learn to draw Doraemon as a sequential set of strokes.

Children develop the ability to depict objects, and arguably even emotions, with only a few pen strokes. They learn to draw their favourite anime characters, family, friends and familiar places. These simple drawings may not resemble reality as captured by a photograph, but they do tell us something about how people represent and reconstruct images of the world around them.

“The function of vision is to update the internal model of the world inside our head, but what we put on a piece of paper is the internal model.”

— Harold Cohen, Reflections on Design and Building AARON.

In our paper, “A Neural Representation of Sketch Drawings”, we present a generative recurrent neural network capable of producing sketches of common objects, with the goal of training a machine to draw and generalize abstract concepts in a manner similar to humans. We train our model on a dataset of hand-drawn sketches, each represented as a sequence of motor actions controlling a pen: which direction to move, when to lift the pen up, and when to stop drawing. In doing so, we created a model that potentially has many applications, from assisting the creative process of an artist, to helping teach students how to draw.

In this work, we model a vector-based representation of images inspired by how people draw. We use recurrent neural networks as our generative model. Not only can our recurrent neural network generate individual vector drawings by constructing a sequence of strokes, like these previous experiments on Generative Handwriting and Generative Kanji, our model can also generate a vector drawing conditional on a latent vector, $z$ , as an input into the model.

Similar to a previous work where we interpolate between multiple latent vectors to generate animated high-resolution morphing MNIST animations, we can train our model on hand-drawn sketches from the yoga category of the QuickDraw dataset, and have it dream up yoga positions in both time and space directions.

An RNN's Understanding of Yoga. pic.twitter.com/0E4AJ3B49X
— hardmaru (@hardmaru) April 14, 2017

“Generating sequential data is the closest computers get to dreaming.”

A Generative Model for Vector Drawings

Our model, sketch-rnn, is based on the sequence-to-sequence (seq2seq) autoencoder framework. It incorporates variational inference and utilizes Hyper Networks as recurrent neural network cells. The goal of a seq2seq autoencoder is to train a network to encode an input sequence into a vector of floating point numbers, called a latent vector, and from this latent vector reconstruct an output sequence using a decoder that replicates the input sequence as closely as possible.

Schematic of sketch-rnn.

In our model, we deliberately add noise to the latent vector. In our paper, we show that by inducing noise into the communication channel between the encoder and the decoder, the model is no longer be able to reproduce the input sketch exactly, but instead must learn to capture the essence of the sketch as a noisy latent vector. Our decoder takes this latent vector and produces a sequence of motor actions used to construct a new sketch. In the figure below, we feed several actual sketches of cats into the encoder to produce reconstructed sketches using the decoder.

Reconstructions from a model trained on cat sketches sampled at varying temperature levels.

It is important to emphasize that the reconstructed cat sketches are not copies of the input sketches, but are instead new sketches of cats with similar characteristics as the inputs. To demonstrate that the model is not simply copying from the input sequence, and that it actually learned something about the way people draw cats, we can try to feed in non-standard sketches into the encoder. When we feed in a sketch of a three-eyed cat, the model generates a similar looking cat that has two eyes instead, suggesting that our model has learned that cats usually only have two eyes.

To show that our model is not simply choosing the closest normal-looking cat from a large collection of memorized cat-sketches, we can try to input something totally different, like a sketch of a toothbrush. We see that the network generates a cat-like figure with long whiskers that mimics the features and orientation of the toothbrush. This suggests that the network has learned to encode an input sketch into a set of abstract cat-concepts embedded into the latent vector, and is also able to reconstruct an entirely new sketch based on this latent vector.

Not convinced? We repeat the experiment again on a model trained on pig sketches and arrive at similar conclusions. When presented with an eight-legged pig, the model generates a similar pig with only four legs. If we feed a truck into the pig-drawing model, we get a pig that looks a bit like the truck.

Reconstructions from a model trained on pig sketches sampled at varying temperature levels.

To investigate how these latent vectors encode conceptual animal features, in the figure below, we first obtain two latent vectors encoded from two very different pigs, in this case a pig head (in the green box) and a full pig (in the orange box). We want to get a sense of how our model learned to represent pigs, and one way to do this is to interpolate between the two different latent vectors, and visualize each generated sketch from each interpolated latent vector. In the figure below, we visualize how the sketch of the pig head slowly morphs into the sketch of the full pig, and in the process show how the model organizes the concepts of pig sketches. We see that the latent vector controls the relatively position and size of the nose relative to the head, and also the existence of the body and legs in the sketch.

Latent space interpolations generated from a model trained on pig sketches.

We would also like to know if our model can learn representations of multiple animals, and if so, what would they look like? In the figure below, we generate sketches from interpolating latent vectors between a cat head and a full pig. We see how the representation slowly transitions from a cat head, to a cat with a tail, to a cat with a fat body, and finally into a full pig. Like a child learning to draw animals, our model learns to construct animals by attaching a head, feet, and a tail to its body. We see that the model is also able to draw cat heads that are distinct from pig heads.

Latent Space Interpolations from a model trained on sketches of both cats and pigs.

These interpolation examples suggest that the latent vectors indeed encode conceptual features of a sketch. But can we use these features to augment other sketches without such features - for example, adding a body to a cat’s head?

Learned relationships between abstract concepts, explored using latent vector arithmetic.

Indeed, we find that sketch drawing analogies are possible for our model trained on both cat and pig sketches. For example, we can subtract the latent vector of an encoded pig head from the latent vector of a full pig, to arrive at a vector that represents the concept of a body. Adding this difference to the latent vector of a cat head results in a full cat (i.e. cat head + body = full cat). These drawing analogies allow us to explore how the model organizes its latent space to represent different concepts in the manifold of generated sketches.

Creative Applications

Exploring the latent space of generated sketches of everyday objects.
Latent space interpolation from left to right, and then top to bottom.

In addition to the research component of this work, we are also super excited about potential creative applications of sketch-rnn. For instance, even in the simplest use case, pattern designers can apply sketch-rnn to generate a large number of similar, but unique designs for textile or wallpaper prints.

As we saw earlier, a model trained to draw pigs can be made to draw pig-like trucks if given an input sketch of a truck. We can extend this result to applications that might help creative designers come up with abstract designs that can resonate more with their target audience.

Similar, but unique cats, generated from a single input sketch in the greenbox (left).
Exploring the latent space of generated chair-cats (right).

For instance, in the earlier figure above, we feed sketches of four different chairs into our cat-drawing model to produce four chair-like cats. We can go further and incorporate the interpolation methodology described earlier to explore the latent space of chair-like cats, and produce a large grid of generated designs to select from.

Exploring the latent space between different objects can potentially enable creative designers to find interesting intersections and relationships between different drawings:

Exploring the latent space between cats and buses, elephants and pigs, and various owls.

We can also use the decoder module of sketch-rnn as a standalone model and train it to predict different possible endings of incomplete sketches. This technique can lead to applications where the model assists the creative process of an artist by suggesting alternative ways to finish an incomplete sketch. In the figure below, we draw different incomplete sketches (in red), and have the model come up with different possible ways to complete the drawings.

The model can start with incomplete sketches and automatically generate different completions.

We believe the best creative works will not be created only with machines, but possibly by designers who use machine learning as a tool to enrich their creative thinking process. In the future, we envision how these tools can be used collaboratively with artists and designers. Below is a simple conceptual example illustrating this collaboration using our model:

“Making it rain with recurrent neural nets.”

We are very excited about the future possibilities of generative vector image modelling. These models will enable many exciting new creative applications in a variety of different directions. They can also serve as a tool to help us improve our understanding of our own creative thought processes. Learn more about sketch-rnn by reading our paper, “A Neural Representation of Sketch Drawings”.

Citation

If you find this work useful, please cite it as:

@article{ha2017neural, title={A neural representation of sketch drawings}, author={Ha, David and Eck, Douglas}, journal={arXiv preprint arXiv:1704.03477}, year={2017} }

Recurrent Neural Network Tutorial for Artists

Sun, 01 Jan 2017 00:00:00 -0600

This post is not meant to be a comprehensive overview of recurrent neural networks. It is intended for readers without any machine learning background. The goal is to show artists and designers how to use a pre-trained neural network to produce interactive digital works using simple Javascript and p5.js library.

Introduction

Handwriting Generation with Javascript

Machine learning has become a popular tool for the creative community in recent years. Techniques such as style transfer, t-sne, autoencoders, generative adversarial networks, and countless other methods have made their way into the digital artist’s toolbox. Many techniques take advantage of convolutional neural networks for feature extraction and feature processing.

On the other end of the spectrum, recurrent neural networks, and other autoregressive models enable powerful tools that can generate realistic sequential data. Artists have employed such techniques to generate text, and music and sounds. One of the areas I feel lacking focus at the moment is on the generation of vector artwork, perhaps due to the lack of available data.

Handwriting is a form of sketch artwork. Recently, I have collaborated with Shan Carter, Ian Johnson, and Chris Olah to publish a post on distill.pub on handwriting generation. In particular, the experiments in the post help visualise the internals of a recurrent neural network trained to generate handwriting. The truth is, that project also served as a kind of meta-experiment for myself. Rather than directly working on the visualisation experiments and writeup, I set out to create a pre-trained handwriting model with an easy-to-use Javascript interface, and have my collaborators, who are highly talented data visualisation artists, experiment with the model to create something out of it. They ended up creating the beautiful interactive visualization experiments in the distill.pub post.

I decided to write this post and make available the same handwriting model used in the distill.pub project along with explanations, with the hope that other artists and designers can also take advantage of these technologies and even go deeper into the field.

Modelling a Handwriting Brain

There are many things going on in our brain when we are writing a letter. Based on what we set out to accomplish by writing, we make a plan about what we are going to write, select a suitable choice of vocabulary, how neat our handwriting needs to be, and then pick up then pen and start writing something on a pad of paper, making decisions about where to place the pen, where to move it, and when to pick it up.

It would be difficult to create a Javascript model to simulate the entire human brain for writing a letter, but we can instead try to model the handwriting brain approximately by focusing on the last part of the handwriting process, namely where to place the pen, where to move it, and when to pick it up. So our model of the handwriting process will only care about the location of the pen, and whether the pen is touching the paper pad.

We also make two assumptions about the model. The first assumption is that the decision of what the model will write next will only depend on whatever it wrote in the past. However, when we write things, while we remember precisely the details of the last pen stroke, we don’t actually remember exactly what we wrote many strokes ago, and only have a vague idea about what was written. This vague idea about what was written before can in fact be modelled within the context of a recurrent neural network.

With an RNN, we can store this type of vague knowledge directly into the neurons of the RNN, and we refer to this object as the hidden state of the RNN. This hidden state is just a vector of floating point numbers that keep track of how active each neuron is. What our model will write next, will therefore depend on its hidden state. This hidden state object will keep on getting updated after something is written, so it will be constantly changing. We will demonstrate how this works in the next section.

The second assumption about the model, is that that the model will not be absolutely certain about what it should write next. In fact, the decision of what the model will write next is random. For example, when the model is writing the character $y$ , it may decide to either continue writing the character to make the bottom hook of the $y$ character larger, or it can decide to suddenly finish off the character and move the pen to another location. Therefore, the output of our model will not be precisely what to write next, but actually a probability distribution of what to write next. We will need to sample from this probability distribution to decide what to actually write next.

These two assumptions can be summarised in the following diagram, which describes the process of using a Recurrent Neural Network model with a hidden state to generate a random sequence.

Generative Sequence Model Framework

Don’t worry if you don’t fully understand this diagram. In the next section, we will demonstrate what is going on line-by-line with Javascript.

Recurrent Neural Network for Handwriting

We have pre-trained a recurrent neural network model to preform the handwriting task described in the previous section. In this section, we will describe how to use this model in Javascript with p5.js. Below is the entire p5.js sketch for handwriting generation.

var x, y;
var dx, dy;
var pen;
var prev_pen;
var rnn_state;
var pdf;
var temperature = 0.65;
var screen_width = window.innerWidth;
var screen_height = window.innerHeight;
var line_color;

function restart() {
  x = 50;
  y = screen_height/2;
  dx = 0;
  dy = 0;
  prev_pen = 0;
  rnn_state = Model.random_state();
  line_color = color(random(255), random(255), random(255))
}

function setup() {
  restart();
  createCanvas(screen_width, screen_height);
  frameRate(60);
  background(255);
  fill(255);
}

function draw() {
  rnn_state = Model.update([dx, dy, prev_pen], rnn_state);
  pdf = Model.get_pdf(rnn_state);
  [dx, dy, pen] = Model.sample(pdf, temperature);
  if (prev_pen == 0) {
    stroke(line_color);
    strokeWeight(2.0);
    line(x, y, x+dx, y+dy);
  }
  x += dx;
  y += dy;
  prev_pen = pen;
  if (x > screen_width - 50) {
    restart();
    background(255);
    fill(255);
  }
}

We will explain how each line works. First, we will need to define a few variables to keep track of where the pen actually is (x, y). Our model will be working with smaller coordinate offsets (dx, dy) and determine where the pen should go next, and (x, y) will be the accumulation of (dx, dy).

var x, y; // absolute coordinates of where the pen is
var dx, dy; // offsets of the pen strokes, in pixels

In addition, our pen will not always be touching the paper. We would need a variable, called pen, to model this. If pen is zero, then our pen is touching the paper at the current time step. We also need to keep track of the pen variable at the previous time step, and store this into prev_pen.

// keep track of whether pen is touching paper. 0 or 1.
var pen;
var prev_pen; // pen at the previous timestep

If we have a list of (dx, dy, pen) variables generated by our model at every time step, it will be enough for us to use this data to draw out what the model has generated on the screen. At the beginning, all of these variables (dx, dy, x, y, pen, prev_pen) will be initialised to zero.

We will also define some variable objects that will be used by our RNN model:

var rnn_state; // store the hidden states the rnn

// store all the parameters of a mixture-density distribution
var pdf;

// controls the amount of uncertainty of the model
// the higher the temperature, the more uncertainty.
var temperature = 0.65; // a non-negative number.

As described in the previous section, the rnn_state variable will represent the hidden state of the RNN. This variable will hold all the vague ideas about what the RNN thought it has written in the past. To update rnn_state, we will use the update function in the model later on in the code.

rnn_state = Model.update([dx, dy, prev_pen], rnn_state);

The object rnn_state will be used to generate the probability distribution of what the model will write next. That probability distribution will be represented as the object called pdf. To obtain the pdf object from rnn_state, we will use the get_pdf function later, like this:

pdf = Model.get_pdf(rnn_state);

An additional variable called temperature allows us to control how confident or how uncertain we want to make the model. Combined with pdf object, we can use the sample function in the model to sample the next set of (dx, dy, pen) values from our probability distribution. We will use the following function later on:

[dx, dy, pen] = Model.sample(pdf, temperature);

The only other variables we need now are to control the colour of the handwriting, and also keep track of the screen’s dimensions of the browser:

// stores the browser's dimensions
var screen_width = window.innerWidth;
var screen_height = window.innerHeight;

// colour for the handwriting
var line_color;

Now we are ready to initialise all these variables we just declared for the actual handwriting generation. We will create a function called restart to initialise these variables since we will be reinitialising them many times later.

function restart() {
  // set x to be 50 pixels from the left of the canvas
  x = 50;
  // set y somewhere in middle of the canvas
  y = screen_height/2;

  // initialize pen's states to zero.
  dx = 0;
  dy = 0;
  prev_pen = 0;
  // note: we draw lines based off previous pen's state

  // randomise the rnn's initial hidden states
  rnn_state = Model.random_state();

  // randomise colour of line by choosing RGB values
  line_color = color(random(255), random(255), random(255))
}

After creating the restart function, we can define the usual p5.js setup function to initialise the sketch.

function setup() {
  restart(); // initialize variables for this demo
  createCanvas(screen_width, screen_height);
  frameRate(60); // 60 frames per second
  // clear the background to be blank white colour
  background(255);
  fill(255);
}

Our handwriting generation will take place in the draw function of the p5.js framework. This function is called 60 times per second. Each time this function is called, the RNN will draw something on the screen.

function draw() {

  // using the previous pen states, and hidden state
  // to get next hidden state 
  rnn_state = Model.update([dx, dy, prev_pen], rnn_state);

  // get the parameters of the probability distribution
  // from the hidden state
  pdf = Model.get_pdf(rnn_state);

  // sample the next pen's states
  // using our probability distribution and temperature
  [dx, dy, pen] = Model.sample(pdf, temperature);

  // only draw on the paper if pen is touching the paper
  if (prev_pen == 0) {
    // set colour of the line
    stroke(line_color);
    // set width of the line to 2 pixels
    strokeWeight(2.0);
    // draw line connecting prev point to current point.
    line(x, y, x+dx, y+dy);
  }

  // update the absolute coordinates from the offsets
  x += dx;
  y += dy;

  // update the previous pen's state
  // to the current one we just sampled
  prev_pen = pen;

  // if the rnn starts drawing close to the right side
  // of the screen, restart our demo
  if (x > screen_width - 50) {
    restart();
    // reset screen
    background(255);
    fill(255);
  }

}

At each frame, the draw function will update the hidden state of the model based on what it has previously drawn on the screen. From this hidden state, the model will generate a probability distribution of what will be generated next. Based on this probability distribution, along with the temperature parameter, we will randomly sample what action it will take in the form of a new set of (dx, dy, pen) variables. Based on this new set of variables, it will draw a line on the screen if the pen was previously touching the paper pad, and update the global location of the pen. Once the global location of the pen gets close to the right side of the screen, it will reset the sketch and start again.

Putting all of this together, we get the following handwriting generation sketch.

So there you have it, handwriting generation in your web browser in with a few lines of Javascript using p5.js.

Sampling from a Probability Distribution with Varying Temperature

The variable pdf is supposed to store the probability distribution of the next pen stroke at each time step. Under the hood, the object pdf actually just contains the parameters of a complicated probability distribution (i.e. the means and the standard deviations of a bunch of Normal Distributions). We have chosen to model the probability distribution of dx and dy as a Mixture Density Distribution.

But what exactly is a mixture density distribution? Well, statisticians (data scientists) like to model probability distributions with well known, mathematically tractable distributions such as the Normal Distribution, and they try to determine the parameters of the distribution (such as the mean and standard deviation for a Normal Distribution) to best fit the data. However, when dealing with something complicated, like the strokes of handwriting data, we find that a simple Normal Distribution is not good enough to model the data. Intuitively, handwriting strokes either stay close to the previous location, or jump to another location when a word or character is finished.

A straight forward way to deal with this problem is to model a probability distribution as the sum of many Normal distributions added together. In our case, we model the handwriting strokes as the sum of 20 Normal distributions. With a Mixture of 20 Normal distributions, our model can do an okay job of modelling the actual handwriting data. More technical details can be obtained in this other post.

When we take this probability distribution, and sample from this distribution to get the set of (dx, dy, pen) values to determine what to draw next, we use the temperature parameter to control the level of uncertainty of the model. If the temperature parameter is very high, then we are more likely to obtain samples in less probable regions of the probability distribution. If the temperature parameter is very low, or close to zero, then we will only obtain samples from the most probable parts of the distribution.

In the sketch below, you can visualise how the probability distribution becomes augmented by varying the temperature parameter. You can control the temperature parameter by dragging around the top orange bar.

Visualise a Mixture Density Distribution by adjusting the Temperature.

For simplicity, the above demo simulates a mixture of twenty, one-dimensional normal distributions with a temperature parameter. In the handwriting model, the probability distribution is a mixture of twenty, two-dimensional normal distributions. In the next sketch, you can modify the temperature of the handwriting model while it is writing something, to see how the handwriting changes with varying temperatures.

When the temperature is kept low, the handwriting model becomes very deterministic, so the handwriting is generally more neat and more realistic. Increasing the temperature will increase the likelihood of choosing less likely probable of the probability distribution, so the handwriting samples will tend to be more funky and uncertain.

Extending the Handwriting Demo

One of the more interesting aspects of combining machine learning with design is to explore the interaction between human and machine. The typical machine learning framework + python stack makes it difficult to deploy truly interactive web applications, as they often require dedicated web services to be written on the server side to process user interaction on the client side. The nice thing about Javascript frameworks such as p5.js is interactive programming can be done with ease, and deployed without much effort in a web browser.

A possible interactive extension we can build from the basic handwriting demo is to have the user interactively write some handwriting onto the screen, and when the user is idle, have the model continuously predict the rest of the handwriting sample. Another extension we can build, similar to the one is the distill.pub post, is to have the model sample multiple possible paths that follow the handwriting path created by the user.

There are countless other possibilities one can experiment with this model. It will also be interesting to combine this model with more advanced frameworks such as paper.js or d3.js to generate better looking strokes.

Use this code!

If you are an artist or designer interested in machine learning, you can fork the github repository containing the code used for this post, and use it to your liking.

This post only scratches the surface of recurrent neural networks. If you want to be more involved into the whole machine learning development process and train your own models, there are excellent resources to learn how to build models with TensorFlow, or keras. If you use keras to build and train your models, there is even a tool called keras.js that allow you to export pre-trained models for web browser usage, so you can build model interfaces like the Javascript handwriting model used in this post. I haven’t personally used keras.js, and I found it fun to just write the handwriting model from scratch in Javascript.

Update:

This model has already been ported to bl.ocks, and extended by a few people to do some very, interesting, things.

大トロ

Collective Intelligence for Deep Learning: A Survey of Recent Developments

Introduction

Historical Background

Image Generation

Deep Reinforcement Learning

Multi-agent learning

Meta-Learning

Summary

Citation

EvoJAX: A Hardware-Accelerated Neuroevolution

Permutation-Invariant Neural Networks for Reinforcement Learning

Introduction

Method

Results

Conclusion

Modern Evolution Strategies for Creativity:Fitting Concrete Images and Abstract Concepts

Neuroevolution of Self-Interpretable Agents

Learning to Predict Without Looking Ahead

Weight Agnostic Neural Networks

Learning Latent Dynamics for Planning from Pixels

Reinforcement Learning for Improving Agent Design

World Models Experiments

Pre-requisite reading

Below is optional

Software Settings

Instructions for running pre-trained models

CarRacing-v0

DoomTakeCover-v0

Instructions for training everything from scratch

DoomTakeCover-v0

CarRacing-v0

Contributing

Citation

World Models

Evolving Stable Strategies

Evolution Strategies for Reinforcement Learning

Deterministic and Stochastic Policies

Evolving Robust Policies for Bipedal Walker

Kuka Robot Arm Grasping

Getting a Minitaur to Learn a Multiple Tasks

ESTool

ESTool with pybullet

Summary

Citation

Acknowledgements

Interesting Links

A Visual Guide to Evolution Strategies

Introduction

What is an Evolution Strategy?

Simple Evolution Strategy

Simple Genetic Algorithm

Covariance-Matrix Adaptation Evolution Strategy (CMA-ES)

Natural Evolution Strategies

OpenAI Evolution Strategy

Fitness Shaping

MNIST

Try It Yourself

What’s Next?

Citation

References and Other Links

My Previous Evolutionary Projects

Teaching Machines to Draw

Introduction

A Generative Model for Vector Drawings

Creative Applications

Citation

Recurrent Neural Network Tutorial for Artists

Introduction

Modelling a Handwriting Brain

Recurrent Neural Network for Handwriting

Sampling from a Probability Distribution with Varying Temperature

Extending the Handwriting Demo

Use this code!

Modern Evolution Strategies for Creativity:
Fitting Concrete Images and Abstract Concepts