<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>大トロ</title>
    <atom:link href="https://blog.otoro.net/feed.xml" rel="self" type="application/rss+xml"/>
    <link>https://blog.otoro.net/</link>
    <description>ml ・ design</description>
    <pubDate>Mon, 03 Oct 2022 22:32:12 -0500</pubDate>
    
      <item>
        <title>Collective Intelligence for Deep Learning: A Survey of Recent Developments</title>
        <link>https://blog.otoro.net/2022/10/01/collectiveintelligence/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2022/10/01/collectiveintelligence/</guid>
        <description>&lt;center&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20221001/magent_large.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;br /&gt;
&lt;i&gt;We survey ideas from complex systems such as swarm intelligence, self-organization, and emergent behavior that are gaining traction in ML. (Figure: Emergence of encirclement tactics in &lt;a href=&quot;https://arxiv.org/abs/1712.00600&quot;&gt;MAgent&lt;/a&gt;.)&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;Unless you’ve been living under a rock, you would’ve noticed that artificial neural networks are now used everywhere. They’re impacting our everyday lives, from performing predictive tasks such as recommendations, facial recognition and object classification, to generative tasks such as machine translation and image, sound, video generation. But with all of these advances, the impressive feats in deep learning required a substantial amount of sophisticated engineering effort.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/alexnet.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;AlexNet&lt;/strong&gt;. Neural network architecture of &lt;a href=&quot;https://dl.acm.org/doi/10.1145/3065386&quot;&gt;AlexNet&lt;/a&gt; (Krizhevsky et al. 2012), the winner of the ImageNet competition in 2012.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Even if we look at the early AlexNet from 2012, which made deep learning famous when it won the ImageNet competition back then, we can see the careful engineering decisions that were involved in its design. Modern networks are often even more sophisticated, and require a pipeline that spans network architecture and careful training schemes. Lots of sweat and labor had to go into producing these amazing results.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/bridge_vs_ant.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Engineered vs Emerged Bridges&lt;/strong&gt;. Left: The Confederation Bridge in Canada. Right: Army ants forming a bridge.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I believe that the way we are currently doing deep learning is like engineering. I think we’re building neural network systems the same way we are building bridges and buildings. But in natural systems, where the concept of &lt;em&gt;emergence&lt;/em&gt; plays a big role, we see complex designs that emerge due to self-organization, and such designs are usually sensitive and responsive to changes in the world around them. Natural systems &lt;em&gt;adapt&lt;/em&gt;, and become &lt;em&gt;a part&lt;/em&gt; of their environment.&lt;/p&gt;

&lt;hr /&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;&lt;b&gt;&lt;i&gt;“Bridges and buildings are all designed to be indifferent to their environment, to withstand fluctuations, not to adapt to them. The best bridge is one that just stands there, whatever the weather.”&lt;/i&gt;&lt;/b&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
 &amp;mdash; Andrew Pickering, &lt;a href=&quot;https://www.goodreads.com/book/show/7636063-the-cybernetic-brain&quot;&gt;The Cybernetic Brain&lt;/a&gt;.
&lt;p&gt;
&lt;/p&gt;
&lt;/center&gt;
&lt;hr /&gt;

&lt;p&gt;In the last few years, I have been noticing many works in deep learning research pop up that have been using some of these ideas from collective intelligence, in particular, the area of emergent complex systems. Recently, &lt;a href=&quot;https://twitter.com/yujin_tang&quot;&gt;Yujin Tang&lt;/a&gt; and I put together a survey paper called &lt;a href=&quot;https://doi.org/10.1177/26339137221114874&quot;&gt;Collective intelligence for deep learning: A survey of recent developments&lt;/a&gt; about this topic, and in this post, I will summarize the key themes in our paper.&lt;/p&gt;

&lt;h2 id=&quot;historical-background&quot;&gt;Historical Background&lt;/h2&gt;

&lt;p&gt;The reason deep learning took its course could be just an accidental outcome in history, and it didn’t have to be this way. In fact, in the earlier days of neural network development, from the 1980s, many groups, including the group led by &lt;a href=&quot;https://en.wikipedia.org/wiki/Leon_O._Chua&quot;&gt;Leon Chuo&lt;/a&gt;, a legendary electrical engineer, worked on neural networks that are much closer to natural adaptive systems. They developed something called &lt;a href=&quot;https://en.wikipedia.org/wiki/Cellular_neural_network&quot;&gt;Cellular Neural Networks&lt;/a&gt;, which are artificial neural network circuits with grids of artificial neurons.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/cenn01.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/cenn02.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Cellular Neural Networks&lt;/strong&gt;. Each neuron in a cellular neural network would receive signals from their neighbors, perform a weighted sum operation, and apply a non-linear activation function, like how we do it today, and send off a signal for its neighbors. The difference between these networks and today’s networks is that they were built using analog circuits, meaning they would work approximately, but also at the time much faster than digital circuits. Also, the wiring of each cell (the ‘parameters’ of each cell) is exactly the same. (Source: &lt;a href=&quot;https://youtu.be/TZrXncVE9e8&quot;&gt;The Chua Lectures&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What is remarkable, is that even in the late 1980s, they have shown that these networks can produce amazing results such as object extraction. These analog networks work in &lt;em&gt;nano-seconds&lt;/em&gt;, something that we were only able to match decades later in digital circuits. They can be programmed to do non-trivial things, like selecting all objects in a pixel image that are &lt;em&gt;pointing up&lt;/em&gt;, and erasing all the other objects. We were able to do these tasks only decades later with deep learning:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;center&gt;&lt;img src=&quot;/assets/20221001/cenn_object_detection.jpg&quot; width=&quot;70%&quot; /&gt;&lt;/center&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Using Cellular Neural Networks to detect all objects that are pointing upwards&lt;/strong&gt;. Left: Input pixel image. Right: Output pixel image.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the past few years, we have noticed many works in deep learning research that explore similar ideas as these cellular neural networks from emergent complex systems, which prompted us to write a survey. The problem is that complex systems is a huge topic, including topics that investigate behavior of actual honeybees and ant colonies, and we will limit our discussion to only a few areas focused on machine learning:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Image Processing and Generative Models&lt;/li&gt;
  &lt;li&gt;Deep Reinforcement Learning&lt;/li&gt;
  &lt;li&gt;Multi-agent Learning&lt;/li&gt;
  &lt;li&gt;Meta-learning (&lt;em&gt;“Learning-to-learn”&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;image-generation&quot;&gt;Image Generation&lt;/h2&gt;

&lt;p&gt;We’ll start by discussing the idea of &lt;em&gt;image generation&lt;/em&gt; using collective intelligence. One cool example of this is a collective &lt;em&gt;human&lt;/em&gt; intelligence: the Reddit &lt;a href=&quot;https://old.reddit.com/r/place/&quot;&gt;r/Place&lt;/a&gt; experiment. In this community experiment, Reddit set up a 1000x1000 pixel canvas, so reddit users have to collectively create a megapixel image. But the interesting thing is the constraints Reddit had imposed: each user is only allowed to paint a &lt;em&gt;single&lt;/em&gt; pixel every 5 minutes:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/reddit_rplace.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Reddit &lt;a href=&quot;https://old.reddit.com/r/place/&quot;&gt;r/Place&lt;/a&gt; experiment&lt;/strong&gt;: Watch a few days of activity happen in minutes.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This experiment lasted for a week, allowing millions of reddit users to draw whatever they want. Because of the time constraint imposed on each user, in order to draw something meaningful, users had to collaborate, and ultimately coordinate some strategy on discussion forums to &lt;em&gt;defend&lt;/em&gt; their design, &lt;em&gt;attack&lt;/em&gt; other designs, and even form alliances. It is truly an example of the creativity of collective human intelligence.&lt;/p&gt;

&lt;p&gt;Early algorithms also computed designs on a pixel grid in a collective way. An example of such an algorithm is a &lt;em&gt;Cellular Automata&lt;/em&gt; exemplified in Conway’s Game of Life, where the state of each pixel on a grid is computed based on a function that depends on the states of its neighbors from the previous time step, and based on simple rules, complex patterns can emerge:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;center&gt;&lt;img src=&quot;/assets/20221001/conway.gif&quot; width=&quot;50%&quot; /&gt;&lt;/center&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Conway’s Game of Life&lt;/strong&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A recent work, Neural Cellular Automata (&lt;a href=&quot;https://distill.pub/2020/growing-ca/&quot;&gt;Mordvintsev et al., 2020&lt;/a&gt;), tried to extend the concept of CA’s, but replace the simple rules with a neural network, so in a sense, it is really similar to Cellular Neural Networks from the 1980s that I discussed earlier. But in this work, they apply Neural CAs to image generation, like in the Reddit r/Place example, at each time step, a pixel will be randomly chosen and updated based on the output of a single neural network function whose inputs are only the values of the pixel’s immediate neighbors.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;center&gt;&lt;img src=&quot;/assets/20221001/neuralca.jpg&quot; width=&quot;100%&quot; /&gt;&lt;/center&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Neural Cellular Automata&lt;/strong&gt; for Image Generation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;They show that a Neural CA can be trained to output any particular given design based on a sparse stochastic sampling rule and an almost empty initial canvas. Here are some examples of 3 Neural CAs producing three designs. What is remarkable about this method is that when we see some corruption in the image, the algorithm would attempt to regenerate the corrupt part automatically in its own way.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;center&gt;&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/neural_ca.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;/center&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Neural Cellular Automata&lt;/strong&gt; regenerating corrupted images.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Neural CA’s can also perform prediction tasks in a collective fashion. For example, they can be applied to classify &lt;a href=&quot;https://distill.pub/2020/selforg/mnist/&quot;&gt;MNIST digits&lt;/a&gt;, but the difference here is that each pixel must produce its own prediction based on its own pixel, and predictions from its immediate neighbors, so its own prediction will also influence the predictions of its neighbors too and change their opinions over time, like in a democratic society. Over time, usually some consensus is made across the collection of pixels, but sometimes, we can see interesting effects, like if the digit is written in a weird way, there will be different steady states of predictions across different regions of the digit.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;center&gt;&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/neural_ca_mnist.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;/center&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;&lt;a href=&quot;https://distill.pub/2020/selforg/mnist/&quot;&gt;Self-classifying MNIST Digits&lt;/a&gt;&lt;/strong&gt;. A Neural Cellular Automata trained to recognize MNIST digits created by (&lt;a href=&quot;https://distill.pub/2020/selforg/mnist/&quot;&gt;Randazzo et al. 2020&lt;/a&gt;) is also available as an interactive web demo. Each cell is only allowed to see the contents of a single pixel and communicate with its neighbors. Over time, a consensus will be formed as to which digit is the most likely pixel, but interestingly, disagreements may result depending on the location of the pixel where the prediction is made.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Neural CA’s are not confined to generating pixels. They can also generate voxels and 3D shapes. Recent work even used Neural CA to produce designs in Minecraft, which are sort of like voxels. They can produce things like buildings and trees, but what’s most interesting is that, since some components inside Minecraft are active rather than passive, they can also generate functional machines with behavior.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/neuralca_minecraft.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;Neural CAs have also been applied to the regeneration of Minecraft entities. In this work &lt;a href=&quot;https://arxiv.org/abs/2103.08737&quot;&gt;Sudhakaran, 2021&lt;/a&gt;, the authors’ formulation enabled the regeneration of not only Minecraft buildings, trees, but also simple functional machines in the game such as worm-like creatures that can even regenerate into two distinct creatures when cut in half.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here, they show that when one of these functional machines get cut in half, each half can regenerate itself morphogenetically, to end up with two functional machines.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;center&gt;&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/morphogenesis.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;/center&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Morphogenesis&lt;/strong&gt;. Aside from regeneration, the Neural CA system in Minecraft is able to regrow parts of simple functional machines (such as a virtual creature in the game). They demonstrate a morphogenetic creature growing into 2 distinct creatures when cut in half in Minecraft.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;deep-reinforcement-learning&quot;&gt;Deep Reinforcement Learning&lt;/h2&gt;

&lt;p&gt;Another popular area within Deep Learning is to train neural networks with reinforcement learning for tasks like locomotion control. Here are a few examples of these &lt;em&gt;Mujoco Humanoid&lt;/em&gt; benchmark environments and their state-of-the-art solutions:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;center&gt;&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/humanoid_sota.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;/center&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;State-of-the-art Mujoco Humanoids&lt;/strong&gt;. You may not like it, but this is what peak performance looks like.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What usually happens is that all of the input observation states (in the case of the humanoid, we have 376 observations) are fed into a deep neural network, the “policy”, that will output the 17 actions required to control the actuators of the humanoid for it to move forward. Typically, these policy networks tend to overfit the training environment, so you end up with solutions that only work for this exact design and simulation environment.&lt;/p&gt;

&lt;p&gt;We’ve seen some interesting works recently that look at using a collective controller approach for these problems. In particular, in &lt;a href=&quot;https://arxiv.org/abs/2007.04976&quot;&gt;Huang et al., 2020&lt;/a&gt;, rather than having one policy network take all of the inputs and output all of the actions, here, they use a single shared policy for every actuator in the agent, effectively decomposing an agent into a collection of agents connected by limbs:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/rl_module.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;Traditional RL methods train a specific policy for a particular robot with a fixed morphology. But recent work, like the one shown here by &lt;a href=&quot;https://arxiv.org/abs/2007.04976&quot;&gt;Huang et al. 2020&lt;/a&gt; attempts to train a single modular neural network responsible for controlling a single part of a robot. The resulting global policy of each robot is thus the result of the coordination of these identical modular neural networks, something which has emerged from local interaction. This system can generalize across a variety of different skeletal structures, from hoppers to quadrupeds, and even to some unseen morphologies.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These policies can communicate bi-directionally with their neighbors, so over time, a global policy can emerge from local interaction. Not only do they train this single policy for one agent design, but it must work across dozens of designs in a training set, so here, every one of these agents are controlled by the same policy that governs each actuator:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/rl_module.gif&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;One identical neural network&lt;/strong&gt; controlling every actuator must work across all of these designs.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;They show that this type of collective system has some zero-shot generalization capabilities and can also control agents with not only different design variations with different limb lengths and masses, but also novel designs not in the training set, and also deal with unseen challenges:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;center&gt;&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/rl_module.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;/center&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Well, why rely on a fixed design? Another work, &lt;a href=&quot;https://arxiv.org/abs/1902.05546&quot;&gt;Pathak et al., 2019&lt;/a&gt; looks at getting every limb limb to figure out a way to self-assemble, and &lt;em&gt;learn&lt;/em&gt; a design to perform tasks like balancing and locomotion:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;center&gt;
&lt;img src=&quot;/assets/20221001/self_assembling_limbs.jpg&quot; width=&quot;70%&quot; /&gt;
&lt;/center&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Self-assembling limbs&lt;/strong&gt;. Self-organization also enables systems in RL environments to self-configure its own design for a given task. In &lt;a href=&quot;https://arxiv.org/abs/1902.05546&quot;&gt;Pathak et al., 2019&lt;/a&gt;, the authors explored such dynamic and modular agents and showed that they can generalize to not only unseen environments, but also to unseen morphologies composed of additional modules.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;They show that this approach can generalize to cases even when you have double or half the number of limbs the system was trained on–something simply not possible with traditional deep RL. Even a system trained with traditional deep RL would work, but the self-assembling solutions consistently prove to be more robust to unseen challenges like wind, or in the case of locomotion, handle new types of terrain such as hurdles and stairs:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/self_assembling_limbs.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This type of collective policy making can also be applied to image-based RL tasks too. In a recent &lt;a href=&quot;https://attentionneuron.github.io/&quot;&gt;paper&lt;/a&gt; that Yujin Tang and I presented at NeurIPS, we looked at feeding each patch from a video feed into identical sensory neuron units, and these sensory neurons must figure out the context of its own input channel, and then self-organize using an attention mechanism for communication, to collectively output motor commands for the agent. This allows the agent to still work even when the patches on the screen are all shuffled:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20221001/car_racing.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20221001/mt_fuji.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Sensory Substitution&lt;/strong&gt;. Using the properties of self-organization and attention, our paper, &lt;a href=&quot;https://attentionneuron.github.io/&quot;&gt;Tang and Ha, 2021&lt;/a&gt;, investigated RL agents that treat their observations as an arbitrarily ordered, variable-length list of sensory inputs. Here, they partition the input in visual tasks such as CarRacing into a 2D grid of small patches, and shuffled their ordering. Each sensory neuron in the system receives a stream of a particular patch of pixels, and through coordination, must complete the task at hand. This agent works with new backgrounds that it hasn’t seen during training (it’s only seen the green grass background).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The work is inspired by the idea of sensory substitution, where different parts of the brain can be retrained to process different sensory modalities, enabling us to adapt our senses to crucial information sources.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/sensory_substitution.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Neuroscientist Paul Bach-y-rita (1934-2006)&lt;/strong&gt; is known as “the father of sensory substitution”.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This method works on non-vision tasks too. When we apply this method to a locomotion task, like this ant agent, we can shuffle the ordering of the 28 inputs quite frequently, and our agent will quickly adjust to a dynamic observation space:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;center&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/attention_agent_ants.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/center&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Permutation invariant reinforcement learning agents adapting to sensory substitutions.&lt;/strong&gt; The ordering of the ant’s 28 observations are randomly shuffled every 200 time-steps. Unlike the standard policy, our policy is not affected by the suddenly permuted inputs.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We can get the agent to play a Puzzle Pong game where the patches are constantly reshuffled, and we show that the system can also work with partial information, like with only 70% of the patches, which are all shuffled:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20221001/pong_reshuffle.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20221001/pong_occluded_reshuffle.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;

&lt;h2 id=&quot;multi-agent-learning&quot;&gt;Multi-agent learning&lt;/h2&gt;

&lt;p&gt;The earlier reinforcement learning examples were mainly about decomposing a single agent into a smaller collection of agents. But what we do know from complex systems is that emergence often occurs at much larger scales than 10 or 20 agents. Perhaps we need a collection of thousands or more individual agents to interact meaningfully for complex “super organisms” to emerge.&lt;/p&gt;

&lt;p&gt;A few years back there was a paper that looked at taking advantage of hardware accelerators, like GPUs, to enable significant scaling up of multi-agent reinforcement learning. In this work called MAgent (&lt;a href=&quot;https://arxiv.org/abs/1712.00600&quot;&gt;Zheng et al., 2018&lt;/a&gt;), they proposed a framework to get up to a million agents, though simple ones, to engage in various grid world multi-agent environments, and furthermore, they can have one population of agents pit against another population of agents in a collective self-play manner.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/magent.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;MAgent (&lt;a href=&quot;https://arxiv.org/abs/1712.00600&quot;&gt;Zheng et al., 2018&lt;/a&gt;)&lt;/strong&gt; is a set of environments where large numbers of pixel agents in a gridworld interact in battles or other competitive scenarios. Unlike most platforms that focus on RL research with a single agent or only few agents, their aim is to support RL research that scales up to millions of agents.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The hardware revolution brought about by deep learning can enable us to take advantage of the hardware and use them to train truly large scale collective behavior. In some of these experiments, they observe predator-prey loops, and encirclement tactics emerge from truly large-scale multi-agent reinforcement learning. These macro-level collective intelligence will probably not emerge from traditional small-scale multi-agent environments:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/magent_large.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I would like to note that this work was from 2018, and hardware acceleration progress has only exponentially increased since then. A recent demo from NVIDIA last year showcased a physics engine that can now handle thousands of agents acting in a realistic physics simulation, unlike the simple gridworld environment. I believe that in the future, we could see really interesting studies of emergent behavior using these newer technologies.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 70%;&quot;&gt;&lt;source src=&quot;/assets/20221001/nvidia.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;Recent advances in GPU hardware enables realistic 3D simulation of thousands of robot models, such as the one shown in this figure by &lt;a href=&quot;https://arxiv.org/abs/2109.11978&quot;&gt;Rudin et al. 2021&lt;/a&gt;. Such advances open the door for large-scale 3D simulation of artificial agents that can interact with each other and collectively develop intelligent behavior.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;meta-learning&quot;&gt;Meta-Learning&lt;/h2&gt;

&lt;p&gt;These increases in compute capabilities won’t stop at simulation. I’ll end with a discussion on how collective behavior is being applied to meta-learning. We can think of an artificial neural network as a collection of neurons and synapses, each of which can be modeled as an individual agent, and collectively, these agents all interact inside a system where the ability to &lt;em&gt;learn&lt;/em&gt; is an emergent property.&lt;/p&gt;

&lt;p&gt;Currently, our concept of artificial neural networks are simply weight matrices between nodes with a non-linear activation function. But with the extra compute, we can also explore really interesting directions where we can simulate generalized version of neural networks, where perhaps every “neuron” is implemented as an identical recurrent neural network (which can in principle compute anything). I remember &lt;a href=&quot;https://twitter.com/hardmaru/status/1109986348545368064&quot;&gt;several neuroscience papers&lt;/a&gt; exploring this theme, see neuroscientist Mark Humphries’s excellent &lt;a href=&quot;https://medium.com/the-spike/your-cortex-contains-17-billion-computers-9034e42d34f2&quot;&gt;blog post&lt;/a&gt;.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/neuron_as_neural_network.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Each “Neuron” is an Artificial Neural Network&lt;/strong&gt;. “If we think the brain is a computer, because it is like a neural network, then now we must admit that individual neurons are computers too. All 17 billion of them in your cortex; perhaps all 86 billion in your brain.” — &lt;a href=&quot;https://medium.com/the-spike/your-cortex-contains-17-billion-computers-9034e42d34f2&quot;&gt;Mark Humphries&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Rather than neurons though, recently, we have seen some pretty ambitious works, modeling the &lt;em&gt;synapse&lt;/em&gt; as a recurrent neural network. This is because when we look at how a standard neural network is trained, we go through a forward pass of the network to forward propagate the inputs of the network to the output, and then we use the back propagation algorithm to “back propagate” the error signals back from the output layer to the input layer, using gradients to adjust the weights, so in principle, an RNN synapse can also learn something like the backpropagation rule, or perhaps something even better.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/vsmetaml.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Each Synapse is a Recurrent Neural Network&lt;/strong&gt;. Recent work by &lt;a href=&quot;https://arxiv.org/abs/2104.04657&quot;&gt;Sandler et al., 2021&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/abs/2012.14905&quot;&gt;Kirsch and Schmidhuber, 2020&lt;/a&gt; attempt to generalize the accepted notion of artificial neural networks, where each neuron can hold multiple states rather than a scalar value, and each synapse function bi-directionally to facilitate both learning and inference. In this figure, (&lt;a href=&quot;https://arxiv.org/abs/2109.10781&quot;&gt;Kirsch et al. 2021&lt;/a&gt;) use an identical recurrent neural network (RNN) (with different internal hidden states) to model each synapse, and show that the network can be trained by simply running the RNNs forward, without using backpropagation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So rather than relying on this forward and back propagation, we can model each synapse of a neural network with a recurrent neural network, which is a universal computer, to &lt;em&gt;learn&lt;/em&gt; how to best forward and back propagate the signals, or learning how to learn. The “hidden states” of each RNN would essentially define what the “weights” are in a highly plastic way.&lt;/p&gt;

&lt;p&gt;Recent works, &lt;a href=&quot;https://arxiv.org/abs/2104.04657&quot;&gt;Sandler et al., 2021&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/abs/2012.14905&quot;&gt;Kirsch and Schmidhuber, 2020&lt;/a&gt;, have shown that these approaches are a generalization of back propagation. They can even experimentally train these meta learning network exactly replicate perfectly the back propagation operation and perform stochastic gradient descent. But more importantly, they can evolve learning rules that can learn more efficiently than stochastic gradient descent, or even ADAM.&lt;/p&gt;

&lt;p&gt;In the following experiment, &lt;a href=&quot;https://arxiv.org/abs/2012.14905&quot;&gt;Kirsch and Schmidhuber, 2020&lt;/a&gt; trained this type of meta learning system, called variable shared meta learner, the blue line, to learn a learning rule using only the MNIST dataset, where the learning rule here outperforms backprop SGD and Adam baselines, which is expected since the learning rule learned is fine-tuned to the MNIST dataset. But when they test the learning rule on a new dataset, like Fashion-MNIST, they see similar performance gains:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20221001/vsmetaml_result.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;These works are still in their early stages, but I think such approaches of modeling neural networks as a truly collective set of identical neurons or synapses, rather than fixed unique weights, are a really promising direction that will really change the sub-field of meta-learning.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Neural network systems are highly complex. We may never be able to truly understand how they work at the level as simple idealized systems that can be explained (and predicted) with relatively simple physical laws. I believe that deep learning research can benefit from looking at neural network systems: their construction, training, and deployment, as complex systems. I hope this blog post is a useful survey of several ideas from complex systems that makes neural network systems more robust and adaptive to changes to their environments.&lt;/p&gt;

&lt;p&gt;If you are interested in reading more, please check out our &lt;a href=&quot;https://journals.sagepub.com/doi/10.1177/26339137221114874&quot;&gt;paper&lt;/a&gt; published in &lt;a href=&quot;https://journals.sagepub.com/home/COL&quot;&gt;Collective Intelligence&lt;/a&gt;.&lt;/p&gt;

&lt;hr /&gt;

&lt;h3 id=&quot;citation&quot;&gt;Citation&lt;/h3&gt;

&lt;p&gt;If you find this blog post useful, please cite our paper as:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;
@article{doi:10.1177/26339137221114874,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;author = {David Ha and Yujin Tang},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;title ={Collective intelligence for deep learning: A survey of recent developments},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;journal = {Collective Intelligence},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;volume = {1},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;number = {1},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;year = {2022},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;doi = {10.1177/26339137221114874},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;URL = {https://doi.org/10.1177/26339137221114874},&lt;br /&gt;
}
&lt;/code&gt;&lt;/p&gt;
</description>
        <pubDate>Sat, 01 Oct 2022 00:00:00 -0500</pubDate>
      </item>
    
      <item>
        <title>EvoJAX: A Hardware-Accelerated Neuroevolution</title>
        <link>https://blog.otoro.net/2022/02/10/evojax/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2022/02/10/evojax/</guid>
        <description>&lt;center&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 50%;&quot;&gt;&lt;source src=&quot;/assets/20220210/evojax_waterworld.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;EvoJAX is a hardware-accelerated neuroevolution toolkit built on top of JAX. It can help run a wide range of evolution experiments within minutes on a TPU/GPU, compared to hours or days on CPU clusters.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Redirecting to &lt;a href=&quot;https://github.com/google/evojax/&quot;&gt;github.com/google/evojax/&lt;/a&gt;, where the repo resides.&lt;/p&gt;

&lt;script&gt;
console.log('redirect.');
window.location.href = &quot;https://github.com/google/evojax/&quot;;
&lt;/script&gt;

</description>
        <pubDate>Thu, 10 Feb 2022 00:00:00 -0600</pubDate>
      </item>
    
      <item>
        <title>Permutation-Invariant Neural Networks for Reinforcement Learning</title>
        <link>https://blog.otoro.net/2021/11/18/attentionneuron/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2021/11/18/attentionneuron/</guid>
        <description>&lt;center&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20211118/cover_orig.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;br /&gt;
&lt;i&gt;Reinforcement learning agents typically perform poorly if provided with inputs that were not clearly defined in training. A new approach enables RL agents to perform well, even when subject to corrupt, incomplete, or shuffled inputs.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;Note: This blog post about our &lt;a href=&quot;https://attentionneuron.github.io/&quot;&gt;paper&lt;/a&gt; is written by &lt;a href=&quot;https://twitter.com/yujin_tang&quot;&gt;Yujin Tang&lt;/a&gt; and myself, and was originally posted on &lt;a href=&quot;https://ai.googleblog.com/2021/11/permutation-invariant-neural-networks.html&quot;&gt;Google AI Blog&lt;/a&gt;. It has been cross-posted here for archival purposes.&lt;/em&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p style=&quot;text-align: center;&quot;&gt;&lt;i&gt;“The brain is able to use information coming from the skin as if it were coming from the eyes. We don’t see with the eyes or hear with the ears, these are just the receptors, seeing and hearing in fact goes on in the brain.”&lt;/i&gt;&lt;/p&gt;

&lt;p style=&quot;text-align: right;&quot;&gt;— &lt;a href=&quot;https://en.wikipedia.org/wiki/Paul_Bach-y-Rita&quot;&gt;Paul Bach-y-Rita&lt;/a&gt; &lt;a href=&quot;https://en.wikipedia.org/wiki/Livewired_(book)&quot;&gt;¹&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;People have the amazing ability to use one sensory modality (e.g., touch) to supply environmental information normally gathered by another sense (e.g., vision). This adaptive ability, called &lt;a href=&quot;https://en.wikipedia.org/wiki/Sensory_substitution&quot;&gt;sensory substitution&lt;/a&gt;, is a phenomenon well-known to neuroscience. While difficult adaptations — such as adjusting to seeing things &lt;a href=&quot;https://www.sciencedirect.com/science/article/abs/pii/S0010945217301314&quot;&gt;upside-down&lt;/a&gt;, learning to ride a &lt;a href=&quot;https://ed.ted.com/best_of_web/bf2mRAfC&quot;&gt;“backwards” bicycle&lt;/a&gt;, or learning to “see” by interpreting visual information emitted from a grid of electrodes placed on one’s tongue — require anywhere from weeks, months or even years to attain mastery, people are able to eventually adjust to sensory substitutions.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;img src=&quot;/assets/20211118/tongue.jpg&quot; width=&quot;100%&quot; /&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20211118/reverse_bicycle.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;Examples of Sensory Substitution. Left__: “Tongue Display Unit” (&lt;a href=&quot;https://www.sciencedirect.com/science/article/abs/pii/S0006899301026671&quot;&gt;Maris and Bach-y-Rita, 2001&lt;/a&gt;; Image: &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S1026309811001702#f000020&quot;&gt;Kaczmarek, 2011&lt;/a&gt;). &lt;strong&gt;Right&lt;/strong&gt;: The “backwards brain bicycle” (&lt;a href=&quot;https://ed.ted.com/best_of_web/bf2mRAfC&quot;&gt;TED Talk&lt;/a&gt;, &lt;a href=&quot;https://gifs.com/gif/the-backwards-brain-bicycle-smarter-every-day-133-yEEQE8&quot;&gt;Figure&lt;/a&gt;).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In contrast, most neural networks are not able to adapt to sensory substitutions at all. For instance, most &lt;a href=&quot;https://en.wikipedia.org/wiki/Reinforcement_learning&quot;&gt;reinforcement learning&lt;/a&gt; (RL) agents require their inputs to be in a pre-specified format, or else they will fail. They expect fixed-size inputs and assume that each element of the input carries a precise meaning, such as the pixel intensity at a specified location, or state information, like position or velocity. In popular RL benchmark tasks (e.g., &lt;a href=&quot;https://pybullet.org/wordpress/&quot;&gt;Ant&lt;/a&gt; or &lt;a href=&quot;https://github.com/google/brain-tokyo-workshop/tree/master/learntopredict/cartpole&quot;&gt;Cart-pole&lt;/a&gt;), an agent trained using current &lt;a href=&quot;https://github.com/DLR-RM/stable-baselines3&quot;&gt;RL algorithms&lt;/a&gt; will fail if its sensory inputs are changed or if the agent is fed additional noisy inputs that are unrelated to the task at hand.&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;https://attentionneuron.github.io/&quot;&gt;The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning&lt;/a&gt;, a &lt;a href=&quot;https://arxiv.org/abs/2109.02869&quot;&gt;spotlight paper&lt;/a&gt; at &lt;a href=&quot;https://neurips.cc/&quot;&gt;NeurIPS 2021&lt;/a&gt;, we explore permutation invariant neural network agents, which require each of their sensory neurons (receptors that receive sensory inputs from the environment) to figure out the meaning and context of its input signal, rather than explicitly assuming a fixed meaning. Our experiments show that such agents are robust to observations that contain additional redundant or noisy information, and to observations that are corrupt and incomplete.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20211118/ants.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20211118/cartpole.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;Permutation invariant reinforcement learning agents adapting to sensory substitutions. &lt;strong&gt;Left&lt;/strong&gt;: The ordering of the ant’s 28 observations are randomly shuffled every 200 time-steps. Unlike the standard policy, our policy is not affected by the suddenly permuted inputs. &lt;strong&gt;Right&lt;/strong&gt;: Cart-pole agent given many redundant noisy inputs (Try interactive &lt;a href=&quot;https://attentionneuron.github.io/#cartpole_demo_special&quot;&gt;web-demo&lt;/a&gt;).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In addition to adapting to sensory substitutions in state-observation environments (like the ant and cart-pole examples), we show that these agents can also adapt to sensory substitutions in complex visual-observation environments (such as a &lt;a href=&quot;https://gym.openai.com/envs/CarRacing-v0/&quot;&gt;CarRacing&lt;/a&gt; game that uses only pixel observations) and can perform when the stream of input images is constantly being reshuffled:&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20211118/carracing_base.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20211118/carracing_yosemite.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;We partition the visual input from CarRacing into a 2D grid of small patches, and shuffled their ordering (&lt;strong&gt;Left&lt;/strong&gt;). Without any additional training, our agent still performs even when the original training background is replaced with new images (&lt;strong&gt;Right&lt;/strong&gt;).&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;method&quot;&gt;Method&lt;/h2&gt;

&lt;p&gt;Our approach takes observations from the environment at each time-step and feeds each element of the observation into distinct, but identical neural networks (called “sensory neurons”), each with no fixed relationship with one another. Each sensory neuron integrates over time information from only their particular sensory input channel. Because each sensory neuron receives only a small part of the full picture, they need to &lt;a href=&quot;https://en.wikipedia.org/wiki/Self-organization&quot;&gt;self-organize&lt;/a&gt; through communication in order for a global coherent behavior to &lt;a href=&quot;https://en.wikipedia.org/wiki/Emergence&quot;&gt;emerge&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/20211118/schematic_input.png&quot; width=&quot;100%&quot; /&gt;
&lt;em&gt;&lt;strong&gt;Illustration of observation segmentation&lt;/strong&gt;. We segment each input into elements, which are then fed to independent sensory neurons. For non-vision tasks where the inputs are usually 1D vectors, each element is a scalar. For vision tasks, we crop each input image into non-overlapping patches.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We encourage neurons to communicate with each other by training them to broadcast messages. While receiving information locally, each individual sensory neuron also continually broadcasts an output message at each time-step. These messages are consolidated and combined into an output vector, called the global latent code, using an attention mechanism similar to that applied in the &lt;a href=&quot;https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html&quot;&gt;Transformer&lt;/a&gt; architecture. A policy network then uses the global latent code to produce the action that the agent will use to interact with the environment. This action is also fed back into each sensory neuron in the next time-step, closing the communication loop.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/20211118/schematic_main.png&quot; width=&quot;100%&quot; /&gt;
&lt;em&gt;&lt;strong&gt;Overview of the permutation-invariant RL method&lt;/strong&gt;. We first feed each individual observation (o&lt;sub&gt;t&lt;/sub&gt;) into a particular sensory neuron (along with the agent’s previous action, a&lt;sub&gt;t-1&lt;/sub&gt;). Each neuron then produces and broadcasts a message independently, and an attention mechanism summarizes them into a global latent code (m&lt;sub&gt;t&lt;/sub&gt;) that is given to the agent’s downstream policy network (𝜋) to produce the agent’s action a&lt;sub&gt;t&lt;/sub&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Why is this system permutation invariant? Each sensory neuron is an identical neural network that is not confined to only process information from one particular sensory input. In fact, in our setup, the inputs to each sensory neuron are not defined. Instead, each neuron must figure out the meaning of its input signal by paying attention to the inputs received by the other sensory neurons, rather than explicitly assuming a fixed meaning. This encourages the agent to process the entire input as an &lt;a href=&quot;https://arxiv.org/abs/1810.00825&quot;&gt;unordered set&lt;/a&gt;, making the system to be permutation invariant to its input.&lt;/p&gt;

&lt;p&gt;The particular form of attention we used has been shown to work with &lt;a href=&quot;https://arxiv.org/abs/1810.00825&quot;&gt;unordered sets&lt;/a&gt;. Since our system treats the input as an unordered set, rather than an ordered list, the output will not be affected by the ordering of the sensory neurons (and by extension the ordering of the observations), thus attaining permutation invariance (our &lt;a href=&quot;https://attentionneuron.github.io/&quot;&gt;paper&lt;/a&gt; includes an intuitive explanation about the permutation invariant of attention, for interested readers looking to dive deeper). By processing the input as an unordered set, rather than a fixed-sized list, the agent can use as many sensory neurons as required, thus enabling it to process observations of arbitrary length. Both of these properties will help the agent adapt to sensory substitutions.&lt;/p&gt;

&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;

&lt;p&gt;We demonstrate the robustness and flexibility of this approach in simpler, state-observation environments, where the observations the agent receives as inputs are low-dimensional vectors holding information about the agent’s states, such as the position or velocity of its components. The agent in the popular &lt;a href=&quot;https://pybullet.org/wordpress/&quot;&gt;Ant&lt;/a&gt; locomotion task has a total of 28 inputs with information that includes positions and velocities. We shuffle the order of the input vector several times during a trial and show that the agent is rapidly able to adapt and is still able to walk forward.&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;https://github.com/google/brain-tokyo-workshop/tree/master/learntopredict/cartpole&quot;&gt;cart-pole&lt;/a&gt;, the agent’s goal is to swing up a cart-pole mounted at the center of the cart and balance it upright. Normally the agent sees only five inputs, but we modify the cartpole environment to provide 15 shuffled input signals, 10 of which are pure noise, and the remainder of which are the actual observations from the environment. The agent is still able to perform the task, demonstrating the system’s capacity to work with a large number of inputs and attend only to channels it deems useful. Such flexibility may find useful applications for processing a large unspecified number of signals, most of which are noise, from ill-defined systems.&lt;/p&gt;

&lt;p&gt;We also apply this approach to high-dimensional vision-based environments where the observation is a stream of pixel images. Here, we investigate screen-shuffled versions of vision-based RL environments, where each observation frame is divided into a grid of patches, and like a puzzle, the agent must process the patches in a shuffled order to determine a course of action to take. To demonstrate our approach on vision-based tasks, we created a shuffled version of Atari Pong.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20211118/pong_occluded.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20211118/pong_base.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Shuffled Pong results&lt;/strong&gt;. &lt;strong&gt;Left&lt;/strong&gt;: Pong agent trained to play using only 30% of the patches matches performance of Atari opponent. &lt;strong&gt;Right&lt;/strong&gt;: Without extra training, when we give the agent more puzzle pieces, its performance increases.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here the agent’s input is a variable-length list of patches, so unlike typical RL agents, the agent only gets to “see” a subset of patches from the screen. In the puzzle pong experiment, we pass to the agent a random sample of patches across the screen, which are then fixed through the remainder of the game. We find that we can discard 70% of the patches (at these fixed-random locations) and still train the agent to perform well against the built-in Atari opponent. Interestingly, if we then reveal additional information to the agent (e.g., allowing it access to more image patches), its performance increases, even without additional training. When the agent receives all the patches, in shuffled order, it wins 100% of the time, achieving the same result with agents that are trained while seeing the entire screen.&lt;/p&gt;

&lt;p&gt;We find that imposing additional difficulty during training by using unordered observations has additional benefits, such as improving generalization to unseen variations of the task, like when the background of the &lt;a href=&quot;https://gym.openai.com/envs/CarRacing-v0/&quot;&gt;CarRacing&lt;/a&gt; training environment is replaced with a novel image. To understand why the agent is capable of generalizing to new backgrounds, we visualize the patches of the (shuffled) screen to which the agent was paying attention. We find that the absence of fixed-structure in the observations seems to encourage the agent to learn the essential structures in the environment (e.g., road edges) to best perform its task. We see that these attention attributes also transfer over to test environments, helping the agent generalize its policy to new backgrounds.&lt;/p&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20211118/carracing_attention.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20211118/carracing_kyoto.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;&lt;em&gt;Shuffled CarRacing results. The agent has learned to focus its attention (indicated by the highlighted patches) on the road boundaries. &lt;strong&gt;Left&lt;/strong&gt;: Training environment. &lt;strong&gt;Right&lt;/strong&gt;: Test environment with new background.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The permutation invariant neural network agents presented here can handle ill-defined, varying observation spaces. Our agents are robust to observations that contain redundant or noisy information, or observations that are corrupt and incomplete. We believe that permutation invariant systems open up numerous possibilities in reinforcement learning.&lt;/p&gt;

&lt;p&gt;If you’re interested to learn more about this work, we invite readers to read our &lt;a href=&quot;https://attentionneuron.github.io/&quot;&gt;interactive article&lt;/a&gt; (&lt;a href=&quot;https://arxiv.org/abs/2109.02869&quot;&gt;pdf&lt;/a&gt; version) or watch our &lt;a href=&quot;https://youtu.be/7nTlXhx0CZI&quot;&gt;video&lt;/a&gt;. We also released &lt;a href=&quot;https://github.com/google/brain-tokyo-workshop&quot;&gt;code&lt;/a&gt; to reproduce our experiments.&lt;/p&gt;
</description>
        <pubDate>Thu, 18 Nov 2021 00:00:00 -0600</pubDate>
      </item>
    
      <item>
        <title>Modern Evolution Strategies for Creativity&amp;#x3a;&lt;/br&gt;Fitting Concrete Images and Abstract Concepts</title>
        <link>https://blog.otoro.net/2021/9/21/esclip/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2021/9/21/esclip/</guid>
        <description>&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_normal_small.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 60%;&quot;&gt;&lt;source src=&quot;/assets/20210921/esclip.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;br /&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/biped_cma.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;i&gt;“A drawing of a cat”&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;CLIP + ES + Triangles&lt;/i&gt;&lt;br /&gt;
&lt;!--&lt;code&gt;
&lt;a href=&quot;https://github.com/worldmodels/&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;--&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Redirecting to &lt;a href=&quot;https://es-clip.github.io/&quot;&gt;es-clip.github.io&lt;/a&gt;, where the article resides.&lt;/p&gt;

&lt;script&gt;
console.log('redirect.');
window.location.href = &quot;https://es-clip.github.io/&quot;;
&lt;/script&gt;

</description>
        <pubDate>Tue, 21 Sep 2021 00:00:00 -0500</pubDate>
      </item>
    
      <item>
        <title>Neuroevolution of Self-Interpretable Agents</title>
        <link>https://blog.otoro.net/2020/3/18/attention/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2020/3/18/attention/</guid>
        <description>&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_normal_small.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20200318/carracing_doom_stages.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;br /&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/biped_cma.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;i&gt;Evolved Biped Walker.&lt;/i&gt;&lt;br/&gt;--&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Agents with a self-attention “bottleneck” not only can solve these tasks from pixel inputs with only 4000 parameters, but they are also better at generalization.&lt;/i&gt;&lt;br /&gt;
&lt;!--&lt;code&gt;
&lt;a href=&quot;https://github.com/worldmodels/&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;--&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Redirecting to &lt;a href=&quot;https://attentionagent.github.io/&quot;&gt;attentionagent.github.io&lt;/a&gt;, where the article resides.&lt;/p&gt;

&lt;script&gt;
console.log('redirect.');
window.location.href = &quot;https://attentionagent.github.io/&quot;;
&lt;/script&gt;

</description>
        <pubDate>Wed, 18 Mar 2020 00:00:00 -0500</pubDate>
      </item>
    
      <item>
        <title>Learning to Predict Without Looking Ahead</title>
        <link>https://blog.otoro.net/2019/10/29/learning/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2019/10/29/learning/</guid>
        <description>&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_normal_small.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20191029/learncartpole5.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;br /&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/biped_cma.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;i&gt;Evolved Biped Walker.&lt;/i&gt;&lt;br/&gt;--&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Rather than hardcoding forward prediction, we try to get agents to learn that they need to predict the future.&lt;/i&gt;&lt;br /&gt;
&lt;!--&lt;code&gt;
&lt;a href=&quot;https://github.com/worldmodels/&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;--&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Redirecting to &lt;a href=&quot;https://learningtopredict.github.io/&quot;&gt;learningtopredict.github.io&lt;/a&gt;, where the article resides.&lt;/p&gt;

&lt;script&gt;
console.log('redirect.');
window.location.href = &quot;https://learningtopredict.github.io/&quot;;
&lt;/script&gt;

</description>
        <pubDate>Tue, 29 Oct 2019 00:00:00 -0500</pubDate>
      </item>
    
      <item>
        <title>Weight Agnostic Neural Networks</title>
        <link>https://blog.otoro.net/2019/6/12/wann/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2019/6/12/wann/</guid>
        <description>&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_normal_small.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20190612/wann_cover.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;br /&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/biped_cma.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;i&gt;Evolved Biped Walker.&lt;/i&gt;&lt;br/&gt;--&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;We search for neural network architectures that can already perform various tasks even when they use random weight values.&lt;/i&gt;&lt;br /&gt;
&lt;!--&lt;code&gt;
&lt;a href=&quot;https://github.com/worldmodels/&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;--&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Redirecting to &lt;a href=&quot;https://weightagnostic.github.io/&quot;&gt;weightagnostic.github.io&lt;/a&gt;, where the article resides.&lt;/p&gt;

&lt;script&gt;
console.log('redirect.');
window.location.href = &quot;https://weightagnostic.github.io/&quot;;
&lt;/script&gt;

</description>
        <pubDate>Wed, 12 Jun 2019 00:00:00 -0500</pubDate>
      </item>
    
      <item>
        <title>Learning Latent Dynamics for Planning from Pixels</title>
        <link>https://blog.otoro.net/2019/2/15/planet/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2019/2/15/planet/</guid>
        <description>&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_normal_small.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;https://planetrl.github.io/assets/mp4/planet_intro.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;br /&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/biped_cma.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;i&gt;Evolved Biped Walker.&lt;/i&gt;&lt;br/&gt;--&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;PlaNet learns a world model from image inputs only and successfully leverages it for planning in latent space.&lt;/i&gt;&lt;br /&gt;
&lt;!--&lt;code&gt;
&lt;a href=&quot;https://github.com/worldmodels/&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;--&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Redirecting to &lt;a href=&quot;https://planetrl.github.io/&quot;&gt;planetrl.github.io&lt;/a&gt;, where the article resides.&lt;/p&gt;

&lt;script&gt;
console.log('redirect.');
window.location.href = &quot;https://planetrl.github.io/&quot;;
&lt;/script&gt;

</description>
        <pubDate>Fri, 15 Feb 2019 00:00:00 -0600</pubDate>
      </item>
    
      <item>
        <title>Reinforcement Learning for Improving Agent Design</title>
        <link>https://blog.otoro.net/2018/10/10/design-rl/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2018/10/10/design-rl/</guid>
        <description>&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_normal_small.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;https://storage.googleapis.com/quickdraw-models/sketchRNN/designrl/augmentbipedsmalllegs.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;br /&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/biped_cma.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;i&gt;Evolved Biped Walker.&lt;/i&gt;&lt;br/&gt;--&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Little dude rewarded for having little legs.&lt;/i&gt;&lt;br /&gt;
&lt;!--&lt;code&gt;
&lt;a href=&quot;https://github.com/worldmodels/&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;--&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Redirecting to &lt;a href=&quot;https://designrl.github.io/&quot;&gt;designrl.github.io&lt;/a&gt;, where the article resides.&lt;/p&gt;

&lt;script&gt;
console.log('redirect.');
window.location.href = &quot;https://designrl.github.io/&quot;;
&lt;/script&gt;

</description>
        <pubDate>Wed, 10 Oct 2018 00:00:00 -0500</pubDate>
      </item>
    
      <item>
        <title>World Models Experiments</title>
        <link>https://blog.otoro.net/2018/06/09/world-models-experiments/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2018/06/09/world-models-experiments/</guid>
        <description>&lt;center&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20180609/worldmodels_experiments_small.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;br /&gt;
&lt;code&gt;
&lt;a href=&quot;https://github.com/hardmaru/WorldModelsExperiments&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;In this article I will give step-by-step instructions for reproducing the experiments in the &lt;a href=&quot;https://worldmodels.github.io&quot;&gt;World Models&lt;/a&gt; article (&lt;a href=&quot;https://arxiv.org/abs/1803.10122&quot;&gt;pdf&lt;/a&gt;). The reference TensorFlow implementation is on &lt;a href=&quot;https://github.com/hardmaru/WorldModelsExperiments&quot;&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Other people have implemented World Models independently. There is an implementation in &lt;a href=&quot;https://medium.com/applied-data-science/how-to-build-your-own-world-model-using-python-and-keras-64fb388ba459&quot;&gt;Keras&lt;/a&gt; that reproduces part of the CarRacing-v0 experiment. There is also another project in &lt;a href=&quot;https://dylandjian.github.io/world-models/&quot;&gt;PyTorch&lt;/a&gt; that attempts to apply this model on &lt;a href=&quot;https://blog.openai.com/retro-contest/&quot;&gt;OpenAI Retro Sonic&lt;/a&gt; environments.&lt;/p&gt;

&lt;p&gt;For general discussion about the World Models article, there are already some good discussion threads here in the GitHub &lt;a href=&quot;https://github.com/worldmodels/worldmodels.github.io/issues&quot;&gt;issues&lt;/a&gt; page of the interactive article. If you have any issues specific to the code, please don’t hessitate to raise an &lt;a href=&quot;https://github.com/hardmaru/WorldModelsExperiments/issues&quot;&gt;issue&lt;/a&gt; to discuss.&lt;/p&gt;

&lt;h1 id=&quot;pre-requisite-reading&quot;&gt;Pre-requisite reading&lt;/h1&gt;

&lt;p&gt;I recommend reading the following articles to gain some background knowledge before attempting to reproduce the experiments.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://worldmodels.github.io/&quot;&gt;World Models&lt;/a&gt; (&lt;a href=&quot;https://arxiv.org/abs/1803.10122&quot;&gt;pdf&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.otoro.net/2017/10/29/visual-evolution-strategies/&quot;&gt;A Visual Guide to Evolution Strategies&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.otoro.net/2017/11/12/evolving-stable-strategies/&quot;&gt;Evolving Stable Strategies&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&quot;below-is-optional&quot;&gt;&lt;em&gt;Below is optional&lt;/em&gt;&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.otoro.net/2015/06/14/mixture-density-networks/&quot;&gt;Mixture Density Networks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.otoro.net/2015/11/24/mixture-density-networks-with-tensorflow/&quot;&gt;Mixture Density Networks with TensorFlow&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Read tutorials on Variational Autoencoders if you are not familiar with them. Some Examples:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jmetzen.github.io/2015-11-27/vae.html&quot;&gt;Variational Autoencoder in TensorFlow&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.keras.io/building-autoencoders-in-keras.html&quot;&gt;Building Autoencoders in Keras&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.otoro.net/2016/04/01/generating-large-images-from-latent-vectors/&quot;&gt;Generating Large Images from Latent Vectors&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Be familiar with RNNs for continuous sequence generation:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1308.0850&quot;&gt;Generating Sequences With Recurrent Neural Networks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1704.03477&quot;&gt;A Neural Representation of Sketch Drawings&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.otoro.net/2015/12/12/handwriting-generation-demo-in-tensorflow/&quot;&gt;Handwriting Generation Demo in TensorFlow&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.otoro.net/2017/01/01/recurrent-neural-network-artist/&quot;&gt;Recurrent Neural Network Tutorial for Artists&lt;/a&gt;.&lt;/p&gt;

&lt;h1 id=&quot;software-settings&quot;&gt;Software Settings&lt;/h1&gt;

&lt;p&gt;I have tested the code with the following settings:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Ubuntu 16.04&lt;/li&gt;
  &lt;li&gt;Python 3.5.4&lt;/li&gt;
  &lt;li&gt;TensorFlow 1.8.0&lt;/li&gt;
  &lt;li&gt;NumPy 1.13.3&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/ppaquette/gym-doom&quot;&gt;VizDoom Gym Levels&lt;/a&gt; &lt;code&gt;(Latest commit 60ff576 on Mar 18, 2017)&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;OpenAI Gym 0.9.4 (&lt;strong&gt;Note: Gym 1.0+ breaks this experiment. Only tested for 0.9.x&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;cma 2.2.0&lt;/li&gt;
  &lt;li&gt;mpi4py 2, see &lt;a href=&quot;https://github.com/hardmaru/estool&quot;&gt;estool&lt;/a&gt;, which we have forked for this project.&lt;/li&gt;
  &lt;li&gt;Jupyter Notebook for model testing, and tracking progress.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I use a combination of OS X for inference, but trained models using Google Cloud VMs. I trained the V and M models on a P100 GPU instance, but trained the controller C on pure CPU instance with 64 cpu-cores (&lt;a href=&quot;https://cloud.google.com/compute/pricing&quot;&gt;n1-standard-64&lt;/a&gt;) using CMA-ES. I will outline which part of the training requires GPUs and which parts use only CPUs, and try to keep your costs low for running this experiment.&lt;/p&gt;

&lt;h1 id=&quot;instructions-for-running-pre-trained-models&quot;&gt;Instructions for running pre-trained models&lt;/h1&gt;

&lt;p&gt;You only need to clone the repo into your desktop computer running in CPU-mode to reproduce the results with pre-trained models provided in the repo. No Clould VM or GPUs necessary.&lt;/p&gt;

&lt;h2 id=&quot;carracing-v0&quot;&gt;&lt;a href=&quot;https://gym.openai.com/envs/CarRacing-v0/&quot;&gt;CarRacing-v0&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;If you are using a MacBook Pro, I recommend setting the resolution to “More Space”, since the CarRacing-v0 environment renders at a larger resolution and doesn’t fit in the default screen settings.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20180609/macbook_resolution.jpeg&quot; width=&quot;75%&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;In the command line, go into the &lt;code class=&quot;highlighter-rouge&quot;&gt;carracing&lt;/code&gt; subdirectory. Try to play the game yourself, run &lt;code class=&quot;highlighter-rouge&quot;&gt;python env.py&lt;/code&gt; in a terminal. You can control the car using the four arrow keys on the keyboard. Press (up, down) for accelerate/brake, and (left/right) for steering.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20180609/carracing_human_play.png&quot; width=&quot;100%&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;In this environment, a new random track is generated for each run. While I can consistently get above 800 if I drive very carefully, it is hard for me to consistently get a score above 900 points. Some Stanford &lt;a href=&quot;https://twitter.com/hardmaru/status/934872621077839872&quot;&gt;students&lt;/a&gt; also found it tough to get consistently higher than 900. The requirement to solve this environment is to obtain an average score of 900 over 100 consecutive random trails.&lt;/p&gt;

&lt;p&gt;To run the pre-trained model once and see the agent in full-rendered mode, run:&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;python model.py render log/carracing.cma.16.64.best.json&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Run the pre-trained model 100 times in &lt;code class=&quot;highlighter-rouge&quot;&gt;no-render&lt;/code&gt; mode (in &lt;code class=&quot;highlighter-rouge&quot;&gt;no-render&lt;/code&gt; mode, it still renders something simpler on the screen due to the need to use OpenGL for this environment to extract the pixel information as observations):&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;python model.py norender log/carracing.cma.16.64.best.json&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This command will output the score for each 100 trials, and after running 100 times. It will also output the average score and standard deviation. The average score should be above 900.&lt;/p&gt;

&lt;p&gt;To run the pre-trained controller inside of an environment generated using M and visualized using V:&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;python dream_model.py log/carracing.cma.16.64.best.json&lt;/code&gt;&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20180609/carracing_dream.png&quot; width=&quot;50%&quot; /&gt;
&lt;/center&gt;

&lt;h2 id=&quot;doomtakecover-v0&quot;&gt;&lt;a href=&quot;https://gym.openai.com/envs/DoomTakeCover-v0/&quot;&gt;DoomTakeCover-v0&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;In the &lt;code class=&quot;highlighter-rouge&quot;&gt;doomrnn&lt;/code&gt; directory, run &lt;code class=&quot;highlighter-rouge&quot;&gt;python doomrnn.py&lt;/code&gt; to play inside of an environment generated by M.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20180609/doomrnn_dream_env.png&quot; width=&quot;50%&quot; /&gt;
&lt;/center&gt;

&lt;p&gt;You can hit left, down, or right to play inside of this envrionment. To visualize the pre-trained model playing inside of the real environment, run:&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;python model.py doomreal render log/doomrnn.cma.16.64.best.json&lt;/code&gt;&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20180609/doomrnn_actual.png&quot; width=&quot;100%&quot; /&gt;
&lt;/center&gt;

&lt;p&gt;Note that this environment is modified to also display the cropped 64x64px frames, in addition to the reconstructed frames and actual frames of the game. To run model inside the actual environment 100 times and compute the mean score, run:&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;python model.py doomreal norender log/doomrnn.cma.16.64.best.json&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You should get a mean score of over 900 time-steps over 100 random episodes. The above two lines still work if you substitute &lt;code class=&quot;highlighter-rouge&quot;&gt;doomreal&lt;/code&gt; with &lt;code class=&quot;highlighter-rouge&quot;&gt;doomrnn&lt;/code&gt; if you want to get the statistics of the agent playing inside of the generated environment. If you wish to change the temperature of the generated environment, modify the constant &lt;code class=&quot;highlighter-rouge&quot;&gt;TEMPERATURE&lt;/code&gt; inside &lt;code class=&quot;highlighter-rouge&quot;&gt;doomrnn.py&lt;/code&gt;, which is currently set to 1.25.&lt;/p&gt;

&lt;p&gt;To visualie the model playing inside of the generated environment, run:&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;python model.py doomrnn render log/doomrnn.cma.16.64.best.json&lt;/code&gt;&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20180609/doomrnn_dream_agent.png&quot; width=&quot;50%&quot; /&gt;
&lt;/center&gt;

&lt;h1 id=&quot;instructions-for-training-everything-from-scratch&quot;&gt;Instructions for training everything from scratch&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;DoomTakeCover-0&lt;/code&gt; experiment should take less than 24 hours to completely reproduce from scratch using a P100 instance and 64-core CPU instance on &lt;a href=&quot;https://cloud.google.com/&quot;&gt;Google Cloud Platform&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;doomtakecover-v0-1&quot;&gt;&lt;a href=&quot;https://gym.openai.com/envs/DoomTakeCover-v0/&quot;&gt;DoomTakeCover-v0&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;I will discuss the VizDoom experiment first since it requires less compute time to reproduce from scratch. Since you may update the models in the repo, I recommend that you fork the repo and clone/update on your fork. I recommend running any command inside of a &lt;code class=&quot;highlighter-rouge&quot;&gt;tmux&lt;/code&gt; session so that you can close your ssh connections and the jobs will still run on the background.&lt;/p&gt;

&lt;p&gt;I first create a 64-core CPU instance with ~ 200GB storage and 220GB RAM, and clone the repo in that instance. In the &lt;code class=&quot;highlighter-rouge&quot;&gt;doomrnn&lt;/code&gt; directory, there is a script called &lt;code class=&quot;highlighter-rouge&quot;&gt;extract.py&lt;/code&gt; that will extract 200 episodes from a random poilcy, and save the episodes as &lt;code class=&quot;highlighter-rouge&quot;&gt;.npz&lt;/code&gt; files in &lt;code class=&quot;highlighter-rouge&quot;&gt;doomrnn/record&lt;/code&gt;. A bash script called &lt;code class=&quot;highlighter-rouge&quot;&gt;extract.bash&lt;/code&gt; will run &lt;code class=&quot;highlighter-rouge&quot;&gt;extract.py&lt;/code&gt; 64 times (~ one job per CPU core), so by running &lt;code class=&quot;highlighter-rouge&quot;&gt;bash extract.bash&lt;/code&gt;, we will generate 12,800 &lt;code class=&quot;highlighter-rouge&quot;&gt;.npz&lt;/code&gt; files in &lt;code class=&quot;highlighter-rouge&quot;&gt;doomrnn/record&lt;/code&gt;. Some instances might randomly fail, so we generate a bit of extra data, although in the end we only use 10,000 episodes for training V and M. This process will take a few hours (probably less than 5 hours).&lt;/p&gt;

&lt;p&gt;After the &lt;code class=&quot;highlighter-rouge&quot;&gt;.npz&lt;/code&gt; files have been created in the &lt;code class=&quot;highlighter-rouge&quot;&gt;record&lt;/code&gt; subdirectory, I create a P100 GPU instance with ~ 200GB storage and 220GB RAM, and clone the repo there too. I use the ssh copy command, &lt;code class=&quot;highlighter-rouge&quot;&gt;scp&lt;/code&gt;, to copy all of the &lt;code class=&quot;highlighter-rouge&quot;&gt;.npz&lt;/code&gt; files from the CPU instance to the GPU instance, into the same &lt;code class=&quot;highlighter-rouge&quot;&gt;record&lt;/code&gt; subdirectory. You can use the &lt;code class=&quot;highlighter-rouge&quot;&gt;gcloud&lt;/code&gt; tool if &lt;code class=&quot;highlighter-rouge&quot;&gt;scp&lt;/code&gt; doesn’t work. This should be really fast, like less than a minute, if both instances are in the same region. Shut down the CPU instance after you have copied the &lt;code class=&quot;highlighter-rouge&quot;&gt;.npz&lt;/code&gt; files over to the GPU machine.&lt;/p&gt;

&lt;p&gt;On the GPU machine, run the command &lt;code class=&quot;highlighter-rouge&quot;&gt;bash gpu_jobs.bash&lt;/code&gt; to train the VAE, pre-process the recorded dataset, and train the MDN-RNN.&lt;/p&gt;

&lt;p&gt;This &lt;code class=&quot;highlighter-rouge&quot;&gt;gpu_jobs.bash&lt;/code&gt; will run 3 things in sequential order:&lt;/p&gt;

&lt;p&gt;1) &lt;code class=&quot;highlighter-rouge&quot;&gt;python vae_train.py&lt;/code&gt; - which will train the VAE, and after training, the model will be saved in &lt;code class=&quot;highlighter-rouge&quot;&gt;tf_vae/vae.json&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;2) Next, it will pre-process collected data using pre-trained VAE by launching: &lt;code class=&quot;highlighter-rouge&quot;&gt;python series.py&lt;/code&gt;. A new dataset will be created in a subdirectory called &lt;code class=&quot;highlighter-rouge&quot;&gt;series&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;3) After this a &lt;code class=&quot;highlighter-rouge&quot;&gt;series.npz&lt;/code&gt; dataset is saved there, the script will launch the MDN-RNN trainer using this command: &lt;code class=&quot;highlighter-rouge&quot;&gt;python rnn_train.py&lt;/code&gt;. This will produce a model in &lt;code class=&quot;highlighter-rouge&quot;&gt;tf_rnn/rnn.json&lt;/code&gt; and also &lt;code class=&quot;highlighter-rouge&quot;&gt;tf_initial_z/initial_z.json&lt;/code&gt;. The file &lt;code class=&quot;highlighter-rouge&quot;&gt;initial_z.json&lt;/code&gt; saves the initial latent variables (z) of an episode which is needed when we need to generate the environment. This entire process might take 6-8 hours.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20180609/doom_vae_test.png&quot; width=&quot;50%&quot; /&gt;&lt;br /&gt;
&lt;i&gt;The notebook &lt;code&gt;vae_test.ipynb&lt;/code&gt; will visualize input/reconstruction images using your VAE on the training dataset.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;After V and M are trained, and you have the 3 new &lt;code class=&quot;highlighter-rouge&quot;&gt;json&lt;/code&gt; files, you must must now copy &lt;code class=&quot;highlighter-rouge&quot;&gt;vae.json&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;initial_z.json&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;rnn.json&lt;/code&gt; over to &lt;code class=&quot;highlighter-rouge&quot;&gt;tf_models&lt;/code&gt; subdirectory and overwrite previous files that might be there. You should update your git repo with these new models using &lt;code class=&quot;highlighter-rouge&quot;&gt;git add doomrnn/tf_models/*.json&lt;/code&gt; and committing the change to your fork. After you have done this, you can shutdown the GPU machine. You need to start the 64-core CPU instance again, log back into that machine.&lt;/p&gt;

&lt;p&gt;Now on a 64-core CPU instance, run the CMA-ES based training by launching the command: &lt;code class=&quot;highlighter-rouge&quot;&gt;python train.py&lt;/code&gt; inside the &lt;code class=&quot;highlighter-rouge&quot;&gt;doomrnn&lt;/code&gt; directory. This will launch the evolution trainer and continue training until you &lt;code class=&quot;highlighter-rouge&quot;&gt;Ctrl-C&lt;/code&gt; this job. The controller C will be trained inside of M’s generated environment with a temperature of 1.25. You can monitor progress using the &lt;code class=&quot;highlighter-rouge&quot;&gt;plot_training_progress.ipynb&lt;/code&gt; notebook which loads the &lt;code class=&quot;highlighter-rouge&quot;&gt;log&lt;/code&gt; files being generated. After 200 generations (or around 4-5 hours), it should be enough to get decent results, and you can stop this job. I left my job running for close to 1800 generations, although it doesn’t really add much value after 200 generations, so I prefer not to waste your money. Add all of the files inside &lt;code class=&quot;highlighter-rouge&quot;&gt;log/*.json&lt;/code&gt; into your forked repo and then shutdown the instance.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20180609/doomrnn.cma.16.64.wall.svg&quot; width=&quot;100%&quot; /&gt;&lt;br /&gt;
&lt;img src=&quot;/assets/20180609/doomrnn.cma.16.64.svg&quot; width=&quot;100%&quot; /&gt;
&lt;i&gt;Training DoomRNN using CMA-ES. Recording C's performance inside of the generated environment.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Using your desktop instance, and pulling your forked repo again, you can now run the following to test your newly trained V, M, and C models.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;python model.py doomreal render log/doomrnn.cma.16.64.best.json&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can replace &lt;code class=&quot;highlighter-rouge&quot;&gt;doomreal&lt;/code&gt; with &lt;code class=&quot;highlighter-rouge&quot;&gt;doomrnn&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;render&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;norender&lt;/code&gt; to try on the generated environment, or trying your agent 100 times.&lt;/p&gt;

&lt;h2 id=&quot;carracing-v0-1&quot;&gt;&lt;a href=&quot;https://gym.openai.com/envs/CarRacing-v0/&quot;&gt;CarRacing-v0&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;The process for CarRacing-v0 is almost the same as the VizDoom example earlier, so I will discuss the differences in this section.&lt;/p&gt;

&lt;p&gt;Since this environment is built using OpenGL, it relies on a graphics output even in &lt;code class=&quot;highlighter-rouge&quot;&gt;no-render&lt;/code&gt; mode of the gym environment, so in a CloudVM box, I had to wrap the command with a headless X server. You can see that inside the &lt;code class=&quot;highlighter-rouge&quot;&gt;extract.bash&lt;/code&gt; file in &lt;code class=&quot;highlighter-rouge&quot;&gt;carracing&lt;/code&gt; directory, I run &lt;code class=&quot;highlighter-rouge&quot;&gt;xvfb-run -a -s &quot;-screen 0 1400x900x24 +extension RANDR&quot;&lt;/code&gt; before the real command. Other than this, the procedure to collect data, and training the V and M model are the same as VizDoom.&lt;/p&gt;

&lt;p&gt;Please note that after you train your VAE and MDN-RNN models, you must now copy &lt;code class=&quot;highlighter-rouge&quot;&gt;vae.json&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;initial_z.json&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;rnn.json&lt;/code&gt; over to &lt;code class=&quot;highlighter-rouge&quot;&gt;vae&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;initial_z&lt;/code&gt;, and &lt;code class=&quot;highlighter-rouge&quot;&gt;rnn&lt;/code&gt; directories respectively (not &lt;code class=&quot;highlighter-rouge&quot;&gt;tf_models&lt;/code&gt; like in DoomRNN), and overwrite previous files if they were there, and then update the forked repo as usual.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20180609/car_vae_test.png&quot; width=&quot;50%&quot; /&gt;&lt;br /&gt;
&lt;i&gt;&lt;code&gt;vae_test.ipynb&lt;/code&gt; used to examine the VAE trained on &lt;code&gt;CarRacing-v0&lt;/code&gt;'s extracted data.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;In this environment, we use the V and M model as model predictive control (MPC) and train the controller C on the actual environment, rather than inside of the generated environment. So rather than running &lt;code class=&quot;highlighter-rouge&quot;&gt;python train.py&lt;/code&gt; you need to run &lt;code class=&quot;highlighter-rouge&quot;&gt;gce_train.bash&lt;/code&gt; instead to use the headless X sessions to run the CMA-ES trainer. Because we train in the actual environment, training is slower compared to DoomRNN. By running the training inside a &lt;code class=&quot;highlighter-rouge&quot;&gt;tmux&lt;/code&gt; session, you can monitor progress using the &lt;code class=&quot;highlighter-rouge&quot;&gt;plot_training_progress.ipynb&lt;/code&gt; notebook by running Jupyter in another &lt;code class=&quot;highlighter-rouge&quot;&gt;tmux&lt;/code&gt; session in parallel, which loads the &lt;code class=&quot;highlighter-rouge&quot;&gt;log&lt;/code&gt; files being generated.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20180609/carracing.cma.16.64.wall.svg&quot; width=&quot;100%&quot; /&gt;&lt;br /&gt;
&lt;img src=&quot;/assets/20180609/carracing.cma.16.64.svg&quot; width=&quot;100%&quot; /&gt;
&lt;i&gt;Training CarRacing-v0 using CMA-ES. Recording C's performance inside of the actual environment.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;After 150-200 generations (or around 3 days), it should be enough to get around a mean score of ~ 880, which is pretty close to the required score of 900. If you don’t have a lot of money or credits to burn, I recommend you stop if you are satistifed with a score of 850+ (which is around a day of training). Qualitatively, a score of ~ 850-870 is not that much worse compared to our final agent that achieves 900+, and I don’t want to burn your hard-earned money on cloud credits. To get 900+ it might take weeks (who said getting SOTA was easy? :). The final models are saved in &lt;code class=&quot;highlighter-rouge&quot;&gt;log/*.json&lt;/code&gt; and you can test and view them the usual way.&lt;/p&gt;

&lt;h1 id=&quot;contributing&quot;&gt;Contributing&lt;/h1&gt;

&lt;p&gt;There are many cool ideas to try out – For instance, iterative training methods, transfer learning, intrinsic motivation, other environments.&lt;/p&gt;

&lt;center&gt;
&lt;video autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 80%;&quot;&gt;&lt;source src=&quot;/assets/20180609/generative_pixel_pendulum.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
&lt;i&gt;A generative noisy pixel pendulum environment?&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;If you want to extend the code and try out new things, I recommend modifying the code and trying it out to solve a specific new environment, and not try to improve the code to work for multiple environments at the same time. I find that for research work, and when trying to solve difficult environments, specific custom modifications are usually required. You are welcome to submit a pull request with a self-contained subdirectory that is tailored for a specific challenging environment that you had attempted to solve, with instructions in a &lt;code class=&quot;highlighter-rouge&quot;&gt;README.md&lt;/code&gt; file in your subdirectory.&lt;/p&gt;

&lt;h1 id=&quot;citation&quot;&gt;Citation&lt;/h1&gt;

&lt;p&gt;If you found this code useful in an academic setting, please cite:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;
@incollection{ha2018worldmodels,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;title = {Recurrent World Models Facilitate Policy Evolution},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;author = {Ha, David and Schmidhuber, J{\&quot;u}rgen},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;booktitle = {Advances in Neural Information Processing Systems 31},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;pages = {2451--2463},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;year = {2018},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;publisher = {Curran Associates, Inc.},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;url = {https://papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;note = &quot;\url{https://worldmodels.github.io}&quot;,&lt;br /&gt;
}&lt;br /&gt;
&lt;/code&gt;&lt;/p&gt;
</description>
        <pubDate>Sat, 09 Jun 2018 00:00:00 -0500</pubDate>
      </item>
    
      <item>
        <title>World Models</title>
        <link>https://blog.otoro.net/2018/03/27/world-models/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2018/03/27/world-models/</guid>
        <description>&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_normal_small.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20180327/world_card_small.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;br /&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/biped_cma.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;i&gt;Evolved Biped Walker.&lt;/i&gt;&lt;br/&gt;--&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Can agents learn inside of their own dreams?&lt;/i&gt;&lt;br /&gt;
&lt;!--&lt;code&gt;
&lt;a href=&quot;https://github.com/worldmodels/&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;--&gt;
&lt;/center&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Redirecting to &lt;a href=&quot;https://worldmodels.github.io/&quot;&gt;worldmodels.github.io&lt;/a&gt;, where the article resides.&lt;/p&gt;

&lt;script&gt;
console.log('redirect.');
window.location.href = &quot;https://worldmodels.github.io/&quot;;
&lt;/script&gt;

</description>
        <pubDate>Tue, 27 Mar 2018 00:00:00 -0500</pubDate>
      </item>
    
      <item>
        <title>Evolving Stable Strategies</title>
        <link>https://blog.otoro.net/2017/11/12/evolving-stable-strategies/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2017/11/12/evolving-stable-strategies/</guid>
        <description>&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_normal_small.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;video class=&quot;b-lazy&quot; autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20171109/minitaur/duck_normal.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;&lt;br /&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/biped_cma.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;i&gt;Evolved Biped Walker.&lt;/i&gt;&lt;br/&gt;--&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Going for a ride.&lt;/i&gt;&lt;br /&gt;
&lt;code&gt;
&lt;a href=&quot;https://github.com/hardmaru/estool/&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;In the &lt;a href=&quot;/2017/10/29/visual-evolution-strategies/&quot;&gt;previous article&lt;/a&gt;, I have described a few evolution strategies (ES) algorithms that can optimise the parameters of a function without the need to explicitly calculate gradients. These algorithms can be applied to reinforcement learning (RL) problems to help find a suitable set of model parameters for a neural network agent. In this article, I will explore applying ES to some of these RL problems, and also highlight methods we can use to find policies that are more stable and robust.&lt;/p&gt;

&lt;h2 id=&quot;evolution-strategies-for-reinforcement-learning&quot;&gt;Evolution Strategies for Reinforcement Learning&lt;/h2&gt;

&lt;p&gt;While RL algorithms require a reward signal to be given to the agent at every timestep, ES algorithms only care about the final cumulative reward that an agent gets at the end of its rollout in an environment. In many problems, we only know the outcome at the end of the task, such as whether the agent wins or loses, whether the robot arm picks up the object or not, or whether the agent has survived, and these are problems where ES may have an advantage over traditional RL. Below is a pseudo-code that encapsulates a rollout of an agent in an &lt;a href=&quot;https://gym.openai.com/docs/&quot;&gt;OpenAI Gym&lt;/a&gt; environment, where we only care about the cumulative reward:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;rollout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;agent&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;obs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;done&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;total_reward&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;done&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;agent&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_action&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;obs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;obs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reward&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;done&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;step&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;total_reward&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reward&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;total_reward&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We can define &lt;code class=&quot;highlighter-rouge&quot;&gt;rollout&lt;/code&gt; to be the objective function that maps the model parameters of an agent into its fitness score, and use an ES solver to find a suitable set of model parameters as described in the previous &lt;a href=&quot;/2017/10/29/visual-evolution-strategies/&quot;&gt;article&lt;/a&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;env&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gym&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;make&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'worlddomination-v0'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# use our favourite ES&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EvolutionStrategy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;

  &lt;span class=&quot;c&quot;&gt;# ask the ES to give set of params&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;solutions&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

  &lt;span class=&quot;c&quot;&gt;# create array to hold the results&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;fitlist&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;zeros&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;popsize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;c&quot;&gt;# evaluate for each given solution&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;popsize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;# init the agent with a solution&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;agent&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Agent&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;solutions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;# rollout env with this agent&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fitlist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rollout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;agent&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;c&quot;&gt;# give scores results back to ES&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tell&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fitness_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;c&quot;&gt;# get best param &amp;amp; fitness from ES&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;bestsol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bestfit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;solver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

  &lt;span class=&quot;c&quot;&gt;# see if our task is solved&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bestfit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MY_REQUIREMENT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;deterministic-and-stochastic-policies&quot;&gt;Deterministic and Stochastic Policies&lt;/h2&gt;

&lt;p&gt;Our agent takes the observation given to it by the environment as an input, and outputs an action at each timestep during a rollout inside the environment. We can model the agent however we want, and use methods from hard-coded rules, decision trees, linear functions to recurrent neural networks. In this article I use a simple feed-forward network with 2 hidden layers to map from an agent’s observation, a vector &lt;script type=&quot;math/tex&quot;&gt;x&lt;/script&gt;, directly to the actions, a vector &lt;script type=&quot;math/tex&quot;&gt;y&lt;/script&gt;:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;h_1 = f_h(W_1 \; x + b_1)&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;h_2 = f_h(W_2 \; h_1 + b_2)&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;y = f_{out}(W_{out} \; h_2 + b_{out})&lt;/script&gt;

&lt;p&gt;The activation functions &lt;script type=&quot;math/tex&quot;&gt;f_h&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;f_{out}&lt;/script&gt; can be &lt;code class=&quot;highlighter-rouge&quot;&gt;tanh&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;sigmoid&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;relu&lt;/code&gt;, or whatever we want to use. In all of my experiments I use &lt;code class=&quot;highlighter-rouge&quot;&gt;tanh&lt;/code&gt;. For the output layer, sometimes we may want &lt;script type=&quot;math/tex&quot;&gt;f_{out}&lt;/script&gt; to be a pass-through function without nonlinearities. If we concatenate all the weight and bias parameters into a single vector called &lt;script type=&quot;math/tex&quot;&gt;W&lt;/script&gt;, we see that the above neural network is a deterministic function &lt;script type=&quot;math/tex&quot;&gt;y = F(x, W)&lt;/script&gt;. We can then use ES to find a solution &lt;script type=&quot;math/tex&quot;&gt;W&lt;/script&gt; using the search loop described earlier.&lt;/p&gt;

&lt;p&gt;But what if we don’t want our agent’s policy to be deterministic? For certain tasks, even as simple as rock-paper-scissors, the optimal policy is a random action, so we want our agent to be able to learn a stochastic policy. One way to convert &lt;script type=&quot;math/tex&quot;&gt;y=F(x, W)&lt;/script&gt; into a stochastic policy is to make &lt;script type=&quot;math/tex&quot;&gt;W&lt;/script&gt; random. Each model parameter &lt;script type=&quot;math/tex&quot;&gt;w_i \in W&lt;/script&gt; can be a random value drawn from a normal distribution &lt;script type=&quot;math/tex&quot;&gt;N(\mu_i, \sigma_i)&lt;/script&gt;.&lt;/p&gt;

&lt;p&gt;This type of stochastic network is called a &lt;em&gt;Bayesian Neural Network&lt;/em&gt;. A &lt;a href=&quot;http://edwardlib.org/tutorials/bayesian-neural-network&quot;&gt;Bayesian neural network&lt;/a&gt; is a neural network with a prior distribution on its weights. In this case, the model parameters we want to solve for, are the set of &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt; vectors, rather than the weights &lt;script type=&quot;math/tex&quot;&gt;W&lt;/script&gt;. During each forward pass of the network, a new &lt;script type=&quot;math/tex&quot;&gt;W&lt;/script&gt; is drawn from &lt;script type=&quot;math/tex&quot;&gt;N(\mu, \sigma I)&lt;/script&gt;. There are lots of &lt;a href=&quot;https://arxiv.org/abs/1703.02910&quot;&gt;interesting&lt;/a&gt; &lt;a href=&quot;https://github.com/andrewgordonwilson/bayesgan/blob/master/README.md&quot;&gt;works&lt;/a&gt; in the literature applying Bayesian networks to many problems, and also &lt;a href=&quot;http://bayesiandeeplearning.org/&quot;&gt;addressing&lt;/a&gt; many challenges of &lt;a href=&quot;https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.704.7138&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;training&lt;/a&gt; these networks. ES can also be used to directly find solutions for a stochastic policy by setting the solutions space be &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt;, rather than &lt;script type=&quot;math/tex&quot;&gt;W&lt;/script&gt;.&lt;/p&gt;

&lt;p&gt;Stochastic policy networks are also popular in the RL literature. For example, in the &lt;a href=&quot;https://arxiv.org/abs/1707.06347&quot;&gt;Proximal Policy Optimization (PPO)&lt;/a&gt; algorithm, the final layer is a set of &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt; parameters and the action is sampled from &lt;script type=&quot;math/tex&quot;&gt;N(\mu, \sigma I)&lt;/script&gt;. Adding &lt;a href=&quot;https://arxiv.org/abs/1707.06347&quot;&gt;noise&lt;/a&gt; to parameters are also known to encourage the agent to explore the environment and escape from local optima. I find that for many tasks where we need an agent to explore, we do not need the entire &lt;script type=&quot;math/tex&quot;&gt;W&lt;/script&gt; to be random – just the bias is enough. For challenging locomotion tasks, such as the ones in the &lt;a href=&quot;https://blog.openai.com/roboschool/&quot;&gt;roboschool&lt;/a&gt; environment, I often need to use ES to find a stochastic policy where only the bias parameters are drawn from a normal distribution.&lt;/p&gt;

&lt;h2 id=&quot;evolving-robust-policies-for-bipedal-walker&quot;&gt;Evolving Robust Policies for Bipedal Walker&lt;/h2&gt;

&lt;!--Many RL environments are stochastic, meaning even if our agent follows a deterministic policy, the fitness score obtained by our agent will be different for each rollout. For instance, a robot arm grasping task will place objects in random locations to thest the agent's ability to generalize. In this [Bipedal Walker](https://gym.openai.com/envs/BipedalWalkerHardcore-v2/) environment, our bipedal robot agent has to travel through a randomly-generated terrain map of ladders, stumps and pitfalls.--&gt;

&lt;p&gt;One of the areas where I found ES useful is for searching for robust policies. I want to control the tradeoff between data efficiency, and how robust the policy is over several random trials. To demonstrate this, I tested ES on a nice environment called &lt;a href=&quot;https://gym.openai.com/envs/BipedalWalkerHardcore-v2/&quot;&gt;BipedalWalkerHardcore-v2&lt;/a&gt; created by &lt;a href=&quot;https://twitter.com/robo_skills&quot;&gt;Oleg Klimov&lt;/a&gt; using the &lt;a href=&quot;https://github.com/pybox2d/pybox2d/blob/master/README.md&quot;&gt;Box2D Physics Engine&lt;/a&gt;, the same physics engine used in &lt;a href=&quot;https://github.com/estevaofon/angry-birds-python/blob/master/README.md&quot;&gt;Angry Birds&lt;/a&gt;.&lt;/p&gt;

&lt;!--&lt;center&gt;
&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;
&lt;i&gt;Our agent solved &lt;a href=&quot;https://gym.openai.com/envs/BipedalWalkerHardcore-v2/&quot;&gt;BipedalWalkerHardcore-v2&lt;/a&gt;.&lt;/i&gt;&lt;br/&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;--&gt;
&lt;center&gt;
&lt;blockquote class=&quot;twitter-video&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Evolution Strategy Variant + OpenAI Gym &lt;a href=&quot;https://t.co/t2R0QQ5qcH&quot;&gt;pic.twitter.com/t2R0QQ5qcH&lt;/a&gt;&lt;/p&gt;&amp;mdash; hardmaru (@hardmaru) &lt;a href=&quot;https://twitter.com/hardmaru/status/889215446150291458?ref_src=twsrc%5Etfw&quot;&gt;July 23, 2017&lt;/a&gt;&lt;/blockquote&gt; &lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;i&gt;Our agent solved &lt;a href=&quot;https://gym.openai.com/envs/BipedalWalkerHardcore-v2/&quot;&gt;BipedalWalkerHardcore-v2&lt;/a&gt;.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;In this environment our agent has to learn a policy to walk across randomly generated terrain within the time limit without falling over. There are 24 inputs, consisting of 10 lidar sensors, angles and contacts. The agent is not given the absolute coordinates of where it is on the map. The action space is 4 continuous values controlling the torques of its 4 motors. The total reward calculation is based on the total distance achieved by the agent. Generally, if the agent completes a map, it will get score of 300+ points, although a small amount of points will be subtracted based on how much motor torque was applied, so energy usage is also a constraint.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://gym.openai.com/envs/BipedalWalkerHardcore-v2/&quot;&gt;BipedalWalkerHardcore-v2&lt;/a&gt; defines &lt;em&gt;solving&lt;/em&gt; the task as getting an average score of 300+ over 100 consecutive random trials. While it is relatively easy to train an agent to successfully walk across the map using an RL algorithm, it is difficult to get the agent to do so consistently and efficiently, making this task an interesting challenge. To my knowledge, my agent is the only solution known to solve this task so far (as of October 2017).&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;img id=&quot;learning_to_fall_img&quot; src=&quot;/assets/20171109/jpeg/learning_to_fall_img.jpeg&quot; width=&quot;100%&quot; /&gt;&lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Early stages. Learning to walk.&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
  &lt;td&gt;&lt;img id=&quot;learning_local_optima_img&quot; src=&quot;/assets/20171109/jpeg/learning_local_optima_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Learns to correct errors, but still slow ...&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;Because the terrain map is randomly generated for each trial, sometimes we may end up with an easy terrain, or sometimes a very difficult terrain. We don’t want our natural selection process to allow agents with weak policies who had gotten lucky with an easy map to advance to the next generation. We also want to give agents with good policies a chance to redeem themselves. So what I ended up doing, is to define an agent’s episode, as the &lt;em&gt;average&lt;/em&gt; of 16 random rollouts, and use the average of the cumulative rewards over 16 rollouts as its fitness score.&lt;/p&gt;

&lt;p&gt;Another way to look at this is to see that even though we are testing the agent over 100 trials, we usually train it on single trials, so the test-task is not the same as the training-task we are optimising for. By averaging each agent in the population multiple times in a stochastic environment, we narrow the gap between our training set and the test set. If we can overfit to the training set, we might as well overfit to the test set, since that’s an &lt;a href=&quot;https://twitter.com/jacobandreas/status/924356906344267776&quot;&gt;okay&lt;/a&gt; thing to do in RL :)&lt;/p&gt;

&lt;p&gt;Of course, the data efficiency of our algorithm is now 16x worse, but the final policy is a lot more robust. When I tested the final policy over 100 consecutive random trials, we got an average score of over 300 points required to solve this environment. Without this averaging method, the best agent can only obtain an average score of &lt;script type=&quot;math/tex&quot;&gt;\sim&lt;/script&gt; 220 to 230 over 100 trials. To my knowledge, this is the first solution that solves this environment (as of October 2017).&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
&lt;img id=&quot;biped_pepg_final_01_img&quot; src=&quot;/assets/20171109/jpeg/biped_pepg_final_01_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
  &lt;td&gt;
&lt;img id=&quot;biped_pepg_final_02_img&quot; src=&quot;/assets/20171109/jpeg/biped_pepg_final_02_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;i&gt;Winning solutions evolved using &lt;a href=&quot;/2017/10/29/visual-evolution-strategies/&quot;&gt;PEPG&lt;/a&gt; using average-of-16 runs per episode.&lt;/i&gt;
&lt;p&gt;&lt;/p&gt;
&lt;/center&gt;

&lt;p&gt;I also used &lt;a href=&quot;https://arxiv.org/abs/1707.06347&quot;&gt;PPO&lt;/a&gt;, a state-of-the-art policy gradient algorithm for RL, and tried to tune it to the best of my ability to perform well on this task. In the end, I was only able to get PPO to achieve average scores of &lt;script type=&quot;math/tex&quot;&gt;\sim&lt;/script&gt; 240 to 250 over 100 random trials. But I’m sure someone else will be able to use PPO or another RL algorithm to solve this environment in the future. (Please let me know if you do so!)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update (Jan 2018): &lt;a href=&quot;https://github.com/dgriff777&quot;&gt;dgriff777&lt;/a&gt; was able to use a continuous version of A3C+LSTM with 4 stack frames as the input to train BipedalWalkerHardcore-v2 to obtain a score of 300 over 100 random trials. He provided this awesome implementation of his pytorch model on &lt;a href=&quot;https://github.com/dgriff777/a3c_continuous/blob/master/README.md&quot;&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;!--I used a deterministic neural network with 64 and 32 hidden units for each of the hidden layers. At first, when I was trying out this task using the various ES algorithms discussed (PEPG, OpenAI ES, CMA-ES), I was only able to obtain an average score of around 200-240 range after 2000 generations (or around 500K episodes as I used a population size of 256).

I also used the PPO algorithm, and tried hard to tune it to the best of my abilities, but found that it obtained similar scores after 500k episodes. Letting both the ES and PPO run longer, even up to 20M episodes didn't seem to help improve the score. Looking at the leaderboard, the best average scores are also below 240.

I also used the PPO algorithm, and tried hard to tune it to the best of my abilities, but found that it obtained similar scores after 500k episodes. Letting PPO run longer, even up to 20M episodes didn't seem to help improve the score. If you are able to use PPO to solve this task, please let me know!

The issue is because the maps are randomly generated, sometimes mediocre agents will get easier maps, and with some luck they are able to complete the task. After a while it becomes difficult to distinguish better agents in the population because there will always be many agents in the population who had achieved 300+ points in the rollout by being more lucky. A score of 300+ does not guarantee that the agent will be a consistent winner.

So while we are training our agents be able to complete a task only once, we are testing the agent on its ability to complete the task 100 times. Our ES is overfitting agents to sometimes get 300+ points on the training task, while the testing scores plateaus at 240. One solution is to make the train task closer to the test task, since it is okay to overfit to test set in RL :) I defined a &quot;rollout&quot; to be the average of 16 random rollouts for each agent, so during training, each agent is evaluated 16 random maps.

By averaging each agent in the population multiple times in a stochastic environment, we will be selecting agents who are consistent performers. This comes at a tradeoff, since we will need to evaluate 16x more episodes than before. Furthermore, the average rollout fitness scores progresses at a slower pace compared to a single rollout, requiring ES to run for more generations. After ~ 4000 generations, ES is able to obtain policies that can obtain 300-310 points over 100 average runs. While the training task uses the average of 16 runs, in principle, if we had enough compute we can even optimise directly for 100 runs.
--&gt;

&lt;p&gt;The ability to control the tradeoff between data efficiency and policy robustness is quite powerful, and useful in the real world where we need safe policies. In theory, with enough compute, we could have even averaged over of the required 100 rollouts and optimised our Bipedal Walker directly to the requirements. Professional engineers are often required to have their designs satisfy specific Quality Assurance guarantees and meet certain safety factors. We need to be able to take into account such safety factors when we train agents to learn policies that may affect the real world.&lt;/p&gt;

&lt;p&gt;Here are a few other solutions that ES discovered:&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171109/biped/biped_cma.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;&lt;a href=&quot;/2017/10/29/visual-evolution-strategies/&quot;&gt;CMA-ES&lt;/a&gt; solution&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
  &lt;td&gt;
&lt;img id=&quot;biped_oes_img&quot; src=&quot;/assets/20171109/jpeg/biped_oes_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;&lt;a href=&quot;/2017/10/29/visual-evolution-strategies/&quot;&gt;OpenAI-ES&lt;/a&gt; solution&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;I also trained the agent with a stochastic policy network with high initial noise parameters, so the agent sees noise everywhere, and even its actions are noisy. It resulted in the agent learning the task despite not being confident of its input and outputs being accurate (this agent couldn’t get a score of 300+ though):&lt;/p&gt;

&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/bipedstoc/biped_noisy.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;img id=&quot;biped_noisy_img&quot; src=&quot;/assets/20171109/jpeg/biped_noisy_img.jpeg&quot; width=&quot;100%&quot; /&gt;&lt;br /&gt;
&lt;!--&lt;video id=&quot;biped_noisy_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/bipedstoc/biped_noisy.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;&lt;br/&gt;--&gt;
&lt;i&gt;Bipedal walker using a stochastic policy.&lt;/i&gt;&lt;br /&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;h3 id=&quot;kuka-robot-arm-grasping&quot;&gt;Kuka Robot Arm Grasping&lt;/h3&gt;

&lt;p&gt;I also tried to apply ES with this averaging technique on a simplified Kuka robot arm grasping task. This environment is available in the &lt;a href=&quot;https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet/gym/pybullet_envs/bullet&quot;&gt;pybullet environment&lt;/a&gt;. The Kuka model used in the simulation is designed to be similar to a real &lt;a href=&quot;https://www.kuka.com/en-de/products&quot;&gt;Kuka&lt;/a&gt; robot arm. In this simplified task, the agent is given the &lt;a href=&quot;https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/bullet/kukaGymEnv.py#L106&quot;&gt;coordinates &lt;/a&gt;of the object.&lt;/p&gt;

&lt;p&gt;More advanced RL environments may require the agent to infer an action directly from pixel inputs, but we could in principle combine this simplified model with a pre-trained convnet that gives us an estimate of the coordinates as well.&lt;/p&gt;

&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/kuka/kuka.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;img id=&quot;kuka_img&quot; src=&quot;/assets/20171109/jpeg/kuka_img.jpeg&quot; width=&quot;100%&quot; /&gt;
&lt;!--&lt;video id=&quot;kuka_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/kuka/kuka.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
&lt;br /&gt;
&lt;i&gt;Robot arm grasping task using a stochastic policy.&lt;/i&gt;&lt;br /&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;The agent obtains a score of 10000 if it successfully picks up the object, and 0 otherwise. Some points are deducted for energy usage. By averaging a sparse reward over 16 random trials, we can get the ES to optimise for robustness. However, in the end, I was able to get policies that can pick up the object only &lt;script type=&quot;math/tex&quot;&gt;\sim&lt;/script&gt; 70 to 75% of the time with both deterministic and stochastic policies. There is still room for improvement.&lt;/p&gt;

&lt;h2 id=&quot;getting-a-minitaur-to-learn-a-multiple-tasks&quot;&gt;Getting a Minitaur to Learn a Multiple Tasks&lt;/h2&gt;

&lt;p&gt;Learning to perform multiple difficult tasks at the same time make us better at performing individual tasks. For example, Shaolin monks who lift weights while standing on a pole will be able to balance better without the weights. Learning to not spill a cup of water while cruising a car at 80mph in the mountains will make the driver a better illegal street racer. We can also train agents to perform multiple tasks to make them learn more stable policies.&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;!--&lt;img src=&quot;/assets/20171109/shaolin.gif&quot; width=&quot;100%&quot;/&gt;--&gt;
&lt;img id=&quot;shaolin_img&quot; src=&quot;/assets/20171109/jpeg/shaolin_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;!--&lt;video id=&quot;shaolin_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/shaolin.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Shaolin Agents.&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;!--&lt;img id=&quot;learning_to_drift_img&quot; src=&quot;/assets/20171109/learning_to_drift.gif&quot; width=&quot;100%&quot;/&gt;--&gt;
&lt;img id=&quot;learning_to_drift_img&quot; src=&quot;/assets/20171109/jpeg/learning_to_drift_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Learning to drift.&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;This recent work on &lt;a href=&quot;https://arxiv.org/abs/1710.03748&quot;&gt;self-playing&lt;/a&gt; agents demonstrated that agents who learn difficult tasks such as Sumo wrestling (a sport that require many skills) are able to also perform easier tasks, like withstanding wind while walking, without the need for further training. &lt;a href=&quot;https://twitter.com/erwincoumans/status/924352109511819264&quot;&gt;Erwin Coumans&lt;/a&gt; recently tried to experiment with adding a &lt;a href=&quot;https://twitter.com/erwincoumans/status/924352109511819264&quot;&gt;duck&lt;/a&gt; on top of a Minitaur learning to walk ahead. If the duck fell, the Minitaur would also fail, so the hope is that these types of task augmentation will help transfer learned policies from simulation over to the real Minitaur. I took one of his &lt;a href=&quot;https://gist.github.com/erwincoumans/c579e076cbaf7c76caa9a42829408e2e&quot;&gt;examples&lt;/a&gt; and experimented with training the Minitaur and duck combination using ES.&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;!--&lt;img src=&quot;/assets/20171109/minitaur/minitaur_faster.gif&quot; width=&quot;100%&quot;/&gt;--&gt;
&lt;img id=&quot;minitaur_faster_img&quot; src=&quot;/assets/20171109/jpeg/minitaur_faster_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;!--&lt;video id=&quot;minitaur_faster_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/minitaur/minitaur_faster.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;CMA-ES walking policy in &lt;a href=&quot;https://pybullet.org&quot;&gt;pybullet&lt;/a&gt;.&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;!--&lt;img src=&quot;/assets/20171109/minitaur/real_minitaur.gif&quot; width=&quot;100%&quot;/&gt;--&gt;
&lt;img id=&quot;real_minitaur_img&quot; src=&quot;/assets/20171109/jpeg/real_minitaur_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;!--&lt;video id=&quot;real_minitaur_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/minitaur/real_minitaur.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Real Minitaur from &lt;a href=&quot;https://www.ghostrobotics.io/&quot;&gt;Ghost Robotics.&lt;/a&gt;&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;The Minitaur model in &lt;a href=&quot;https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/bullet/minitaur.py&quot;&gt;pybullet&lt;/a&gt; is designed to mimic the real physical Minitaur. However, a policy trained on a perfect simulation environment usually fails in the real world. It may not even generalise to small augmentations of the task inside the simulation. For example, in the figure above is a Minitaur trained to walk ahead (using CMA-ES), but we see that this policy is not always able to carry a duck across the room when we put a duck on top of it inside of the simulation.&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;img id=&quot;duck_notrain_img&quot; src=&quot;/assets/20171109/jpeg/duck_notrain_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_notrain.gif&quot; width=&quot;100%&quot;/&gt;--&gt;
  &lt;!--&lt;video id=&quot;duck_notrain_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/minitaur/duck_notrain.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Walking policy works with duck.&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_normal_small.gif&quot; width=&quot;100%&quot;/&gt;--&gt;
  &lt;video autoplay=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot; loop=&quot;&quot; style=&quot;display: block; margin: auto; width: 100%;&quot;&gt;&lt;source src=&quot;/assets/20171109/minitaur/duck_normal.mp4&quot; type=&quot;video/mp4&quot; /&gt;&lt;/video&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Policy trained on duck.&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;The policy learned from the pure walking task still works to some degree even when the duck is deployed, meaning that the addition of the duck wasn’t so difficult. The duck has a flat stable bottom so it wasn’t too difficult for the Minitaur to keep the duck from falling off its back. I tried to replace the duck with a ball to make the task much harder.&lt;/p&gt;

&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/minitaur/ball_cheating.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;img id=&quot;ball_cheating_img&quot; src=&quot;/assets/20171109/jpeg/ball_cheating_img.jpeg&quot; width=&quot;100%&quot; /&gt;
&lt;!--&lt;video id=&quot;ball_cheating_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/minitaur/ball_cheating.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
&lt;br /&gt;
&lt;i&gt;Learning to cheat.&lt;/i&gt;&lt;br /&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;However, replacing the duck with a ball didn’t immediately result in a stable balancing policy. Instead, CMA-ES found a policy that still technically carried the ball across the floor by first having the ball slide into a hole made for its legs, and then carrying the ball inside this hole. The lesson learned here is that an objective-driven search algorithm will learn to take advantage of any design flaws in the environment and exploit them to reach its objective.&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;!--&lt;img src=&quot;/assets/20171109/minitaur/ball_stoc.gif&quot; width=&quot;100%&quot;/&gt;--&gt;
&lt;img id=&quot;ball_stoc_img&quot; src=&quot;/assets/20171109/jpeg/ball_stoc_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;!--&lt;video id=&quot;ball_stoc_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/minitaur/ball_stoc.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Stochastic policy trained with ball.&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;!--&lt;img src=&quot;/assets/20171109/minitaur/duck_stoc.gif&quot; width=&quot;100%&quot;/&gt;--&gt;
&lt;img id=&quot;duck_stoc_img&quot; src=&quot;/assets/20171109/jpeg/duck_stoc_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;!--&lt;video id=&quot;duck_stoc_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/minitaur/duck_stoc.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Same policy with duck.&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;After making the ball smaller, CMA-ES was able to find a stochastic policy that can walk and balance the ball at the same time. This policy also transferred back to the easier duck task. In the future, I hope these type of task augmentation techniques will be useful for transfer learning to real robots.&lt;/p&gt;

&lt;h2 id=&quot;estool&quot;&gt;ESTool&lt;/h2&gt;

&lt;p&gt;One of the big selling points of ES is that it is easy to parallelise the computation using several workers running on different threads on different CPU cores, or even on &lt;a href=&quot;https://blog.openai.com/evolution-strategies/&quot;&gt;different machines&lt;/a&gt;. Python’s &lt;a href=&quot;https://docs.python.org/2/library/multiprocessing.html&quot;&gt;multiprocessing&lt;/a&gt; makes it simple to launch parallel processes. I prefer to use Message Passing Interface (MPI) with &lt;a href=&quot;https://mpi4py.scipy.org/docs/&quot;&gt;mpi4py&lt;/a&gt; to launch separate python processes for each job. This allows us to get around the &lt;a href=&quot;https://en.wikipedia.org/wiki/Global_interpreter_lock&quot;&gt;global interpreter lock&lt;/a&gt;, and also gives me confidence that each process has its own sandboxed numpy and gym instances which is important when it comes to seeding random number generators.&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;!--&lt;img src=&quot;/assets/20171109/robo/roboschool.gif&quot; width=&quot;100%&quot;/&gt;--&gt;
&lt;img id=&quot;roboschool_img&quot; src=&quot;/assets/20171109/jpeg/roboschool_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;!--&lt;video id=&quot;roboschool_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/robo/roboschool.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Roboschool Hopper, Walker, Ant.&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
  &lt;td&gt;
&lt;img id=&quot;reacher_img&quot; src=&quot;/assets/20171109/jpeg/reacher_img.jpeg&quot; width=&quot;100%&quot; /&gt;
  &lt;!--&lt;video id=&quot;reacher_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/robo/reacher.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
  &lt;br /&gt;
  &lt;center&gt;&lt;i&gt;Roboschool Reacher.&lt;/i&gt;&lt;/center&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;i&gt;Agents evolved using &lt;a href=&quot;https://github.com/hardmaru/estool/&quot;&gt;&lt;code&gt;estool&lt;/code&gt;&lt;/a&gt; on various &lt;a href=&quot;https://blog.openai.com/roboschool/&quot;&gt;roboschool&lt;/a&gt; tasks.&lt;/i&gt;
&lt;p&gt;&lt;/p&gt;
&lt;/center&gt;

&lt;p&gt;I have implemented a simple tool called &lt;a href=&quot;https://github.com/hardmaru/estool/&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;estool&lt;/code&gt;&lt;/a&gt; that uses the &lt;a href=&quot;https://github.com/hardmaru/estool/blob/master/es.py&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;es.py&lt;/code&gt;&lt;/a&gt; library described in the previous &lt;a href=&quot;/2017/10/29/visual-evolution-strategies/&quot;&gt;article&lt;/a&gt; to train simple feed-forward policy networks to perform continuous control RL tasks written with a gym interface. I have used &lt;code class=&quot;highlighter-rouge&quot;&gt;estool&lt;/code&gt; tool to easily train all of the experiments described earlier, as well as various other continuous control tasks inside gym and roboschool. &lt;code class=&quot;highlighter-rouge&quot;&gt;estool&lt;/code&gt; uses MPI for distributed processing so it shouldn’t require too much work to distribute workers over multiple machines.&lt;/p&gt;

&lt;!--This tool will keep track of a population of agents, whos weights are sampled from an ES algorithm, to perform a given gym task. Unlike RL algorithms, only the comulative reward score of a rollout is used as a fitness score to evaluate each agent. `estool` uses MPI for distributed processing.--&gt;

&lt;h2 id=&quot;estool-with-pybullet&quot;&gt;ESTool with pybullet&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/hardmaru/estool/&quot;&gt;GitHub repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In addition to the environments that come with gym and roboschool, &lt;code class=&quot;highlighter-rouge&quot;&gt;estool&lt;/code&gt; works well with most &lt;a href=&quot;https://pybullet.org&quot;&gt;pybullet&lt;/a&gt; gym environments. It is also easy to build custom pybullet environments by modifying existing environments. For example, I was able to make the Minitaur with ball environment (in the &lt;code class=&quot;highlighter-rouge&quot;&gt;custom_envs&lt;/code&gt; directory of the repo) without much effort, and being able to tinker with the environment makes it easier to try out new ideas. If you want to incorporate 3D models from other software packages like &lt;a href=&quot;http://gazebosim.org/tutorials/?tut=ros_urdf&quot;&gt;ROS&lt;/a&gt; or &lt;a href=&quot;https://www.blender-models.com/model-downloads/mechanicalelectronical/robotics/id/star-wars-pit-droid/&quot;&gt;Blender&lt;/a&gt;, you can try building new and interesting pybullet environments and challenge others to try to solve them.&lt;/p&gt;

&lt;p&gt;Many models and environments in pybullet, such as the Kuka robot arm and the Minitaur, are modelled to be similar to the real robot as part of current exciting transfer learning research efforts. In fact, many of these recent &lt;a href=&quot;https://stanfordvl.github.io/ntp/&quot;&gt;cutting&lt;/a&gt; &lt;a href=&quot;https://sites.google.com/view/multi-task-domain-adaptation&quot;&gt;edge&lt;/a&gt; &lt;a href=&quot;https://sermanet.github.io/imitate/&quot;&gt;research&lt;/a&gt; &lt;a href=&quot;https://research.googleblog.com/2017/10/closing-simulation-to-reality-gap-for.html&quot;&gt;papers&lt;/a&gt; are using pybullet to conduct transfer learning experiments.&lt;/p&gt;

&lt;p&gt;You don’t need an expensive Minitaur or Kuka robot arm to play with sim-to-real experiments though. There is a &lt;a href=&quot;https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/bullet/racecar.py&quot;&gt;racecar&lt;/a&gt; model inside pybullet that is modelled after the &lt;a href=&quot;https://mit-racecar.github.io/&quot;&gt;MIT racecar&lt;/a&gt; open source hardware kit. There’s even a pybullet environment that mounts a &lt;a href=&quot;https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/bullet/racecarZEDGymEnv.py&quot;&gt;virtual camera&lt;/a&gt; onto the virtual racecar to give the agent a virtual pixel screen as an input observation.&lt;/p&gt;

&lt;p&gt;Let’s try the easier version first, where the racecar simply needs to learn a policy to move towards a giant ball. In the &lt;a href=&quot;https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/bullet/racecarGymEnv.py&quot;&gt;RacecarBulletEnv-v0&lt;/a&gt; environment, the agent gets the relative coordinates of the ball as an input, and outputs continuous actions that control the motor speed and steering direction. The task is simple enough that it only takes 5 minutes (50 generations) on a 2014 Macbook Pro (with an 8-core CPU) to train. Using &lt;code class=&quot;highlighter-rouge&quot;&gt;estool&lt;/code&gt;, the command below will launch the training job on eight processes and assign each process 4 jobs, to get a total of 32 workers, using CMA-ES to evolve the policies:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python train.py bullet_racecar -o cma -n 8 -t 4
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;The training progress, as well as the model parameters found will be stored in the &lt;code class=&quot;highlighter-rouge&quot;&gt;log&lt;/code&gt; subdirectory. We can run this command to visualise an agent inside the environment using the best policy found:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python model.py bullet_racecar log/bullet_racecar.cma.1.32.best.json
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;center&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/biped/bipedcover.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;!--&lt;img src=&quot;/assets/20171109/robo/simple_racecar.gif&quot; width=&quot;100%&quot;/&gt;&lt;br/&gt;--&gt;
&lt;img id=&quot;simple_racecar_img&quot; src=&quot;/assets/20171109/jpeg/simple_racecar_img.jpeg&quot; width=&quot;100%&quot; /&gt;
&lt;!--&lt;video id=&quot;simple_racecar_video&quot; autoplay muted playsinline loop width=&quot;100%&quot;&gt;&lt;source src=&quot;/assets/20171109/robo/simple_racecar.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
&lt;br /&gt;
&lt;i&gt;pybullet racecar environment, based on the &lt;a href=&quot;https://mit-racecar.github.io/&quot;&gt;MIT Racecar&lt;/a&gt;.&lt;/i&gt;&lt;br /&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;In the simulation, we can use the mouse cursor to move the ball around, and even move the racecar around if we want to interact with it.&lt;/p&gt;

&lt;p&gt;The IPython notebook &lt;code class=&quot;highlighter-rouge&quot;&gt;plot_training_progress.ipynb&lt;/code&gt; can visualise the training history per generation of the racecar agents. At each generation, we can see the best score, the worse score, and the average score across the entire population.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20171109/svg/bullet_racecar.wallclock.svg&quot; width=&quot;100%&quot; /&gt;&lt;br /&gt;
&lt;img src=&quot;/assets/20171109/svg/bullet_racecar.generation.svg&quot; width=&quot;100%&quot; /&gt;
&lt;/center&gt;

&lt;p&gt;Standard locomotion tasks similar to those in roboschool, such as Inverted Pendulum, Hopper, Walker, HalfCheetah, Ant, and Humanoid are also available in pybullet. I found a policy for pybullet’s ant that gets to a score of 3000 within hours on a multi-core machine with a population size of 256, using PEPG:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python train.py bullet_ant -o pepg -n 64 -t 4
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;center&gt;
  &lt;!--&lt;img src=&quot;/assets/20171109/robo/bullet_ant_demo.gif&quot; width=&quot;80%&quot;/&gt;--&gt;
&lt;img id=&quot;bullet_ant_demo_img&quot; src=&quot;/assets/20171109/jpeg/bullet_ant_demo_img.jpeg&quot; width=&quot;80%&quot; /&gt;
  &lt;!--&lt;video id=&quot;bullet_ant_demo_video&quot; autoplay muted playsinline loop width=&quot;80%&quot;&gt;&lt;source src=&quot;/assets/20171109/robo/bullet_ant_demo.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
  &lt;/center&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;!--&lt;img src=&quot;/assets/20171109/robo/bullet_ant.gif&quot; width=&quot;120%&quot;/&gt;--&gt;
&lt;img id=&quot;bullet_ant_img&quot; src=&quot;/assets/20171109/jpeg/bullet_ant_img.jpeg&quot; width=&quot;120%&quot; /&gt;
  &lt;!--&lt;video id=&quot;bullet_ant_video&quot; autoplay muted playsinline loop width=&quot;120%&quot;&gt;&lt;source src=&quot;/assets/20171109/robo/bullet_ant.mp4&quot; type=&quot;video/mp4&quot;/&gt;&lt;/video&gt;--&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;i&gt;Example rollout of &lt;a href=&quot;https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/gym_locomotion_envs.py&quot;&gt;AntBulletEnv&lt;/a&gt;. We can still save rollouts as an .mp4 video using &lt;code&gt;gym.wrappers.Monitor&lt;/code&gt;&lt;/i&gt;
&lt;/center&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20171109/svg/bullet_ant.svg&quot; width=&quot;100%&quot; /&gt;
&lt;/center&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;In this article, I discussed using ES to find policies for a feed-forward neural network agent to perform various continuous control RL tasks defined by a gym environment interface. I described the &lt;a href=&quot;https://github.com/hardmaru/estool/&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;estool&lt;/code&gt;&lt;/a&gt; that allowed me to quickly try different ES algorithms with various settings in a distributed processing environment using the MPI framework.&lt;/p&gt;

&lt;p&gt;So far, I have only discussed methods for training an agent by having it learn a policy from trial-and-error in the environment. This form of training from scratch is referred to as &lt;em&gt;model-free&lt;/em&gt; reinforcement learning. In the next article (&lt;em&gt;if I ever get to writing it&lt;/em&gt;), I will discuss more about &lt;em&gt;model-based&lt;/em&gt; learning, where our agent will learn to exploit a previously learned model to accomplish a given task. And yes, I will still be using evolution.&lt;/p&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you find this work useful, please cite it as:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;
@article{ha2017evolving,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;title&amp;nbsp;&amp;nbsp;&amp;nbsp;= &quot;Evolving Stable Strategies&quot;,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;author&amp;nbsp;&amp;nbsp;= &quot;Ha, David&quot;,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;journal&amp;nbsp;= &quot;blog.otoro.net&quot;,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;year&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;= &quot;2017&quot;,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;url&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;= &quot;https://blog.otoro.net/2017/11/12/evolving-stable-strategies/&quot;&lt;br /&gt;
}
&lt;/code&gt;&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;I want to thank &lt;a href=&quot;https://twitter.com/erwincoumans&quot;&gt;Erwin Coumans&lt;/a&gt; for writing all these great environments, and also for helping me work on making 
&lt;a href=&quot;https://github.com/hardmaru/estool&quot;&gt;ESTool&lt;/a&gt; better. Great research cannot be done without great tools.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20171109/biped/biped_cma.gif&quot; width=&quot;100%&quot; /&gt;&lt;br /&gt;
&lt;i&gt;In the end, it all comes to choices to turn stumbling blocks into stepping stones.&lt;/i&gt;&lt;br /&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;h2 id=&quot;interesting-links&quot;&gt;Interesting Links&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=35VE9WykH1c&quot;&gt;“Fires of a Revolution” Incredible Fast Piano Music (EPIC)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;/2017/10/29/visual-evolution-strategies/&quot;&gt;A Visual Guide to Evolution Strategies&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/hardmaru/estool&quot;&gt;ESTool&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://tuvalu.santafe.edu/~erica/stable.pdf&quot;&gt;Stable or Robust? What’s the Difference?&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://gym.openai.com/docs/&quot;&gt;OpenAI Gym Docs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.openai.com/evolution-strategies/&quot;&gt;Evolution Strategies as a Scalable Alternative to Reinforcement Learning&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://edwardlib.org/&quot;&gt;Edward, A library for probabilistic modeling, inference, and criticism&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=FD8l2vPU5FY&quot;&gt;History of Bayesian Neural Networks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://gym.openai.com/envs/BipedalWalkerHardcore-v2/&quot;&gt;BipedalWalkerHardcore-v2&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.openai.com/roboschool/&quot;&gt;roboschool&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://pybullet.org&quot;&gt;pybullet&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1710.03748&quot;&gt;Emergent Complexity via Multi-Agent Competition&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://research.googleblog.com/2017/10/closing-simulation-to-reality-gap-for.html&quot;&gt;GraspGAN&lt;/a&gt;&lt;/p&gt;

&lt;script&gt;
var replace_list=[
[
&quot;learning_to_fall_img&quot;,
&quot;/assets/20171109/biped/learning_to_fall.gif&quot;
],[
&quot;learning_local_optima_img&quot;,
&quot;/assets/20171109/biped/learning_local_optima.gif&quot;
],[
&quot;biped_pepg_final_01_img&quot;,
&quot;/assets/20171109/biped/biped_pepg_final_01.gif&quot;
],[
&quot;biped_pepg_final_02_img&quot;,
&quot;/assets/20171109/biped/biped_pepg_final_02.gif&quot;
],[
&quot;biped_oes_img&quot;,
&quot;/assets/20171109/biped/biped_oes.gif&quot;
],[
&quot;biped_noisy_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/bipedstoc/biped_noisy.gif&quot;
],[
&quot;kuka_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/8a6ccaf5/anim/kuka/kuka.gif&quot;
],[
&quot;shaolin_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/shaolin.gif&quot;
],[
&quot;learning_to_drift_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/learning_to_drift.gif&quot;
],[
&quot;minitaur_faster_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/minitaur/minitaur_faster.gif&quot;
],[
&quot;real_minitaur_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/minitaur/real_minitaur.gif&quot;
],[
&quot;duck_notrain_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/minitaur/duck_notrain.gif&quot;
],[
&quot;ball_cheating_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/minitaur/ball_cheating.gif&quot;
],[
&quot;ball_stoc_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/minitaur/ball_stoc.gif&quot;
],[
&quot;duck_stoc_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/8a6ccaf5/anim/minitaur/duck_stoc.gif&quot;
],[
&quot;roboschool_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/8a6ccaf5/anim/robo/roboschool.gif&quot;
],[
&quot;reacher_img&quot;,
&quot;/assets/20171109/robo/reacher.gif&quot;
],[
&quot;simple_racecar_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/robo/simple_racecar.gif&quot;
],[
&quot;bullet_ant_demo_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/robo/bullet_ant_demo.gif&quot;
],[
&quot;bullet_ant_img&quot;,
&quot;https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/robo/bullet_ant.gif&quot;
]];

function replace_jpeg(tagname, newurl, time_delay) {
  setTimeout(function(){
    var img;
    console.log('replacing '+tagname+' with a gif.');
    img = document.getElementById(tagname);
    img.src = newurl;
  }, time_delay*1000);
}

for(var i=0;i&lt;replace_list.length;i++) {
  replace_jpeg(replace_list[i][0], replace_list[i][1], 5+5*i);
}

&lt;/script&gt;

</description>
        <pubDate>Sun, 12 Nov 2017 00:00:00 -0600</pubDate>
      </item>
    
      <item>
        <title>A Visual Guide to Evolution Strategies</title>
        <link>https://blog.otoro.net/2017/10/29/visual-evolution-strategies/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2017/10/29/visual-evolution-strategies/</guid>
        <description>&lt;center&gt;
&lt;img src=&quot;/assets/20171031/es_bear.jpeg&quot; width=&quot;60%&quot; /&gt;&lt;br /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Survival of the fittest.&lt;/i&gt;
&lt;!--
&lt;p&gt;&lt;/p&gt;
Evolved Bipedal Walker&lt;br/&gt;
&lt;code&gt;
&lt;a href=&quot;https://github.com/hardmaru/&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;
--&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;In this post I explain how evolution strategies (ES) work with the aid of a few visual examples. I try to keep the equations light, and I provide links to original articles if the reader wishes to understand more details. This is the first post in a series of articles, where I plan to show how to apply these algorithms to a range of tasks from MNIST, OpenAI Gym, Roboschool to PyBullet environments.&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;Neural network models are highly expressive and flexible, and if we are able to find a suitable set of model parameters, we can use neural nets to solve many challenging problems. Deep learning’s success largely comes from the ability to use the backpropagation algorithm to efficiently calculate the gradient of an objective function over each model parameter. With these gradients, we can efficiently search over the parameter space to find a solution that is often good enough for our neural net to accomplish difficult tasks.&lt;/p&gt;

&lt;p&gt;However, there are many problems where the backpropagation algorithm cannot be used. For example, in reinforcement learning (RL) problems, we can also train a neural network to make decisions to perform a sequence of actions to accomplish some task in an environment. However, it is not trivial to estimate the gradient of reward signals given to the agent in the future to an action performed by the agent right now, especially if the reward is realised many timesteps in the future. Even if we are able to calculate accurate gradients, there is also the issue of being stuck in a local optimum, which exists many for RL tasks.&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20171031/biped/biped_local_optima.gif&quot; width=&quot;100%&quot; /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Stuck in a local optimum.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;A whole area within RL is devoted to studying this credit-assignment problem, and great progress has been made in recent years. However, credit assignment is still difficult when the reward signals are sparse. In the real world, rewards can be sparse and noisy. Sometimes we are given just a single reward, like a bonus check at the end of the year, and depending on our employer, it may be difficult to figure out exactly why it is so low. For these problems, rather than rely on a very noisy and possibly meaningless gradient estimate of the future to our policy, we might as well just ignore any gradient information, and attempt to use black-box optimisation techniques such as genetic algorithms (GA) or ES.&lt;/p&gt;

&lt;p&gt;OpenAI published a paper called &lt;a href=&quot;https://blog.openai.com/evolution-strategies/&quot;&gt;Evolution Strategies as a Scalable Alternative to Reinforcement Learning&lt;/a&gt; where they showed that evolution strategies, while being less data efficient than RL, offer many benefits. The ability to abandon gradient calculation allows such algorithms to be evaluated more efficiently. It is also easy to distribute the computation for an ES algorithm to thousands of machines for parallel computation. By running the algorithm from scratch many times, they also showed that policies discovered using ES tend to be more diverse compared to policies discovered by RL algorithms.&lt;/p&gt;

&lt;p&gt;I would like to point out that even for the problem of identifying a machine learning model, such as designing a neural net’s architecture, is one where we cannot directly compute gradients. While &lt;a href=&quot;https://research.googleblog.com/2017/05/using-machine-learning-to-explore.html&quot;&gt;RL&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/abs/1703.00548&quot;&gt;Evolution&lt;/a&gt;, &lt;a href=&quot;https://blog.otoro.net/2016/05/07/backprop-neat/&quot;&gt;GA&lt;/a&gt; etc., can be applied to search in the space of model architectures, in this post, I will focus only on applying these algorithms to search for parameters of a pre-defined model.&lt;/p&gt;

&lt;h2 id=&quot;what-is-an-evolution-strategy&quot;&gt;What is an Evolution Strategy?&lt;/h2&gt;

&lt;center&gt;
&lt;img src=&quot;https://upload.wikimedia.org/wikipedia/commons/8/8b/Rastrigin_function.png&quot; width=&quot;70%&quot; /&gt;&lt;br /&gt;
&lt;i&gt;Two-dimensional Rastrigin function has many local optima (Source: &lt;a href=&quot;https://en.wikipedia.org/wiki/Test_functions_for_optimization&quot;&gt;Wikipedia&lt;/a&gt;&lt;/i&gt;).
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;The diagrams below are top-down plots of &lt;em&gt;shifted&lt;/em&gt; 2D &lt;a href=&quot;https://en.wikipedia.org/wiki/Test_functions_for_optimization&quot;&gt;Schaffer and Rastrigin&lt;/a&gt; functions, two of several simple toy problems used for testing continuous black-box optimisation algorithms. Lighter regions of the plots represent higher values of &lt;script type=&quot;math/tex&quot;&gt;F(x, y)&lt;/script&gt;. As you can see, there are many local optimums in this function. Our job is to find a set of &lt;em&gt;model parameters&lt;/em&gt; &lt;script type=&quot;math/tex&quot;&gt;(x, y)&lt;/script&gt;, such that &lt;script type=&quot;math/tex&quot;&gt;F(x, y)&lt;/script&gt; is as close as possible to the global maximum.&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;center&gt;&lt;i&gt;Schaffer-2D Function&lt;/i&gt;&lt;/center&gt;
  &lt;br /&gt;
  &lt;img src=&quot;/assets/20171031/schaffer/schaffer_label.png&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;center&gt;&lt;i&gt;Rastrigin-2D Function&lt;/i&gt;&lt;/center&gt;
  &lt;br /&gt;
  &lt;img src=&quot;/assets/20171031/rastrigin/rastrigin_label.png&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;Although there are many definitions of evolution strategies, we can define an evolution strategy as an algorithm that provides the user a set of candidate solutions to evaluate a problem. The evaluation is based on an &lt;em&gt;objective function&lt;/em&gt; that takes a given solution and returns a single &lt;em&gt;fitness&lt;/em&gt; value. Based on the fitness results of the current solutions, the algorithm will then produce the next generation of candidate solutions that is more likely to produce even better results than the current generation. The iterative process will stop once the best known solution is satisfactory for the user.&lt;/p&gt;

&lt;p&gt;Given an evolution strategy algorithm called &lt;code class=&quot;highlighter-rouge&quot;&gt;EvolutionStrategy&lt;/code&gt;, we can use in the following way:&lt;/p&gt;

&lt;hr /&gt;
&lt;p&gt;&lt;code&gt;solver = EvolutionStrategy()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;while True:&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;# ask the ES to give us a set of candidate solutions&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;solutions = solver.ask()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;# create an array to hold the fitness results.&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;fitness_list = np.zeros(solver.popsize)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;# evaluate the fitness for each given solution.&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;for i in range(solver.popsize):&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;fitness_list[i] = evaluate(solutions[i])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;# give list of fitness results back to ES&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;solver.tell(fitness_list)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;# get best parameter, fitness from ES&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;best_solution, best_fitness = solver.result()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;if best_fitness &amp;gt; MY_REQUIRED_FITNESS:&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;break&lt;/code&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Although the size of the population is usually held constant for each generation, they don’t need to be. The ES can generate as many candidate solutions as we want, because the solutions produced by an ES are &lt;em&gt;sampled&lt;/em&gt; from a distribution whose parameters are being updated by the ES at each generation. I will explain this sampling process with an example of a simple evolution strategy.&lt;/p&gt;

&lt;h2 id=&quot;simple-evolution-strategy&quot;&gt;Simple Evolution Strategy&lt;/h2&gt;

&lt;p&gt;One of the simplest evolution strategy we can imagine will just sample a set of solutions from a Normal distribution, with a mean &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; and a fixed standard deviation &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt;. In our 2D problem, &lt;script type=&quot;math/tex&quot;&gt;\mu = (\mu_x, \mu_y)&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\sigma = (\sigma_x, \sigma_y)&lt;/script&gt;. Initially, &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; is set at the origin. After the fitness results are evaluated, we set &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; to the best solution in the population, and sample the next generation of solutions around this new mean. This is how the algorithm behaves over 20 generations on the two problems mentioned earlier:&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/schaffer/simplees.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/rastrigin/simplees.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;In the visualisation above, the green dot indicates the mean of the distribution at each generation, the blue dots are the sampled solutions, and the red dot is the best solution found so far by our algorithm.&lt;/p&gt;

&lt;p&gt;This simple algorithm will generally only work for simple problems. Given its greedy nature, it throws away all but the best solution, and can be prone to be stuck at a local optimum for more complicated problems. It would be beneficial to sample the next generation from a probability distribution that represents a more diverse set of ideas, rather than just from the best solution from the current generation.&lt;/p&gt;

&lt;h2 id=&quot;simple-genetic-algorithm&quot;&gt;Simple Genetic Algorithm&lt;/h2&gt;

&lt;p&gt;One of the oldest black-box optimisation algorithms is the genetic algorithm. There are many variations with many degrees of sophistication, but I will illustrate the simplest version here.&lt;/p&gt;

&lt;p&gt;The idea is quite simple: keep only 10% of the best performing solutions in the current generation, and let the rest of the population die. In the next generation, to sample a new solution is to randomly select two solutions from the survivors of the previous generation, and recombine their parameters to form a new solution. This &lt;em&gt;crossover&lt;/em&gt; recombination process uses a coin toss to determine which parent to take each parameter from. In the case of our 2D toy function, our new solution might inherit &lt;script type=&quot;math/tex&quot;&gt;x&lt;/script&gt; or &lt;script type=&quot;math/tex&quot;&gt;y&lt;/script&gt; from either parents with 50% chance. Gaussian noise with a fixed standard deviation will also be injected into each new solution after this recombination process.&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/schaffer/simplega.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/rastrigin/simplega.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;The figure above illustrates how the simple genetic algorithm works. The green dots represent members of the elite population from the previous generation, the blue dots are the offsprings to form the set of candidate solutions, and the red dot is the best solution.&lt;/p&gt;

&lt;p&gt;Genetic algorithms help diversity by keeping track of a diverse set of candidate solutions to reproduce the next generation. However, in practice, most of the solutions in the elite surviving population tend to converge to a local optimum over time. There are more sophisticated variations of GA out there, such as &lt;a href=&quot;http://people.idsia.ch/~juergen/gomez08a.pdf&quot;&gt;CoSyNe&lt;/a&gt;, &lt;a href=&quot;https://blog.otoro.net/2015/03/10/esp-algorithm-for-double-pendulum/&quot;&gt;ESP&lt;/a&gt;, and &lt;a href=&quot;https://blog.otoro.net/2016/05/07/backprop-neat/&quot;&gt;NEAT&lt;/a&gt;, where the idea is to cluster similar solutions in the population together into different species, to maintain better diversity over time.&lt;/p&gt;

&lt;h2 id=&quot;covariance-matrix-adaptation-evolution-strategy-cma-es&quot;&gt;Covariance-Matrix Adaptation Evolution Strategy (CMA-ES)&lt;/h2&gt;

&lt;p&gt;A shortcoming of both the Simple ES and Simple GA is that our standard deviation noise parameter is fixed. There are times when we want to explore more and increase the standard deviation of our search space, and there are times when we are confident we are close to a good optima and just want to fine tune the solution. We basically want our search process to behave like this:&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/schaffer/cmaes.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/rastrigin/cmaes.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;Amazing isn’it it? The search process shown in the figure above is produced by &lt;a href=&quot;https://en.wikipedia.org/wiki/CMA-ES&quot;&gt;Covariance-Matrix Adaptation Evolution Strategy (CMA-ES)&lt;/a&gt;. CMA-ES an algorithm that can take the results of each generation, and adaptively increase or decrease the search space for the next generation. It will not only adapt for the mean &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; and sigma &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt; parameters, but will calculate the entire covariance matrix of the parameter space. At each generation, CMA-ES provides the parameters of a multi-variate normal distribution to sample solutions from. So how does it know how to increase or decrease the search space?&lt;/p&gt;

&lt;p&gt;Before we discuss its methodology, let’s review how to estimate a &lt;a href=&quot;https://en.wikipedia.org/wiki/Covariance_matrix&quot;&gt;covariance matrix&lt;/a&gt;. This will be important to understand CMA-ES’s methodology later on. If we want to estimate the covariance matrix of our entire sampled population of size of &lt;script type=&quot;math/tex&quot;&gt;N&lt;/script&gt;, we can do so using the set of equations below to calculate the maximum likelihood estimate of a covariance matrix &lt;script type=&quot;math/tex&quot;&gt;C&lt;/script&gt;. We first calculate the means of each of the &lt;script type=&quot;math/tex&quot;&gt;x_i&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;y_i&lt;/script&gt; in our population:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\mu_x = \frac{1}{N} \sum_{i=1}^{N}x_i,&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\mu_y = \frac{1}{N} \sum_{i=1}^{N}y_i.&lt;/script&gt;

&lt;p&gt;The terms of the 2x2 covariance matrix &lt;script type=&quot;math/tex&quot;&gt;C&lt;/script&gt; will be:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\sigma_x^2 = \frac{1}{N} \sum_{i=1}^{N}(x_i - \mu_x)^2,&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\sigma_y^2 = \frac{1}{N} \sum_{i=1}^{N}(y_i - \mu_y)^2,&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\sigma_{xy} = \frac{1}{N} \sum_{i=1}^{N}(x_i - \mu_x)(y_i - \mu_y).&lt;/script&gt;

&lt;p&gt;Of course, these resulting mean estimates &lt;script type=&quot;math/tex&quot;&gt;\mu_x&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\mu_y&lt;/script&gt;, and covariance terms &lt;script type=&quot;math/tex&quot;&gt;\sigma_x&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;\sigma_y&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;\sigma_{xy}&lt;/script&gt; will just be an estimate to the actual covariance matrix that we originally sampled from, and not particularly useful to us.&lt;/p&gt;

&lt;p&gt;CMA-ES modifies the above covariance calculation formula in a clever way to make it adapt well to an optimisation problem. I will go over how it does this step-by-step. Firstly, it focuses on the best &lt;script type=&quot;math/tex&quot;&gt;N_{best}&lt;/script&gt; solutions in the current generation. For simplicity let’s set &lt;script type=&quot;math/tex&quot;&gt;N_{best}&lt;/script&gt; to be the best 25% of solutions. After sorting the solutions based on fitness, we calculate the mean &lt;script type=&quot;math/tex&quot;&gt;\mu^{(g+1)}&lt;/script&gt; of the next generation &lt;script type=&quot;math/tex&quot;&gt;(g+1)&lt;/script&gt; as the average of only the best 25% of the solutions in current population &lt;script type=&quot;math/tex&quot;&gt;(g)&lt;/script&gt;, i.e.:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\mu_x^{(g+1)} = \frac{1}{N_{best}} \sum_{i=1}^{N_{best}}x_i,&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\mu_y^{(g+1)} = \frac{1}{N_{best}} \sum_{i=1}^{N_{best}}y_i.&lt;/script&gt;

&lt;p&gt;Next, we use only the best 25% of the solutions to estimate the covariance matrix &lt;script type=&quot;math/tex&quot;&gt;C^{(g+1)}&lt;/script&gt; of the next generation, but the clever &lt;em&gt;hack&lt;/em&gt; here is that it uses the &lt;em&gt;current&lt;/em&gt; generation’s &lt;script type=&quot;math/tex&quot;&gt;\mu^{(g)}&lt;/script&gt;, rather than the updated &lt;script type=&quot;math/tex&quot;&gt;\mu^{(g+1)}&lt;/script&gt; parameters that we had just calculated, in the calculation:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\sigma_x^{2, (g+1)} = \frac{1}{N_{best}} \sum_{i=1}^{N_{best}}(x_i - \mu_x^{(g)})^2,&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\sigma_y^{2, (g+1)} = \frac{1}{N_{best}} \sum_{i=1}^{N_{best}}(y_i - \mu_y^{(g)})^2,&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\sigma_{xy}^{(g+1)} = \frac{1}{N_{best}} \sum_{i=1}^{N_{best}}(x_i - \mu_x^{(g)})(y_i - \mu_y^{(g)}).&lt;/script&gt;

&lt;p&gt;Armed with a set of &lt;script type=&quot;math/tex&quot;&gt;\mu_x&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;\mu_y&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;\sigma_x&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;\sigma_y&lt;/script&gt;, and &lt;script type=&quot;math/tex&quot;&gt;\sigma_{xy}&lt;/script&gt; parameters for the next generation &lt;script type=&quot;math/tex&quot;&gt;(g+1)&lt;/script&gt;, we can now sample the next generation of candidate solutions.&lt;/p&gt;

&lt;p&gt;Below is a set of figures to visually illustrate how it uses the results from the current generation &lt;script type=&quot;math/tex&quot;&gt;(g)&lt;/script&gt; to construct the solutions in the next generation &lt;script type=&quot;math/tex&quot;&gt;(g+1)&lt;/script&gt;:&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;img src=&quot;/assets/20171031/rastrigin/cmaes_step1.png&quot; width=&quot;100%&quot; /&gt;&lt;center&gt;&lt;i&gt;Step 1&lt;/i&gt;&lt;/center&gt;&lt;/td&gt;
  &lt;td&gt;&lt;img src=&quot;/assets/20171031/rastrigin/cmaes_step2.png&quot; width=&quot;100%&quot; /&gt;&lt;center&gt;&lt;i&gt;Step 2&lt;/i&gt;&lt;/center&gt;&lt;/td&gt;
  &lt;td&gt;&lt;img src=&quot;/assets/20171031/rastrigin/cmaes_step3.png&quot; width=&quot;100%&quot; /&gt;&lt;center&gt;&lt;i&gt;Step 3&lt;/i&gt;&lt;/center&gt;&lt;/td&gt;
  &lt;td&gt;&lt;img src=&quot;/assets/20171031/rastrigin/cmaes_step4.png&quot; width=&quot;100%&quot; /&gt;&lt;center&gt;&lt;i&gt;Step 4&lt;/i&gt;&lt;/center&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;ol&gt;
  &lt;li&gt;Calculate the fitness score of each candidate solution in generation &lt;script type=&quot;math/tex&quot;&gt;(g)&lt;/script&gt;.&lt;/li&gt;
  &lt;li&gt;Isolates the best 25% of the population in generation &lt;script type=&quot;math/tex&quot;&gt;(g)&lt;/script&gt;, in purple.&lt;/li&gt;
  &lt;li&gt;Using only the best solutions, along with the mean &lt;script type=&quot;math/tex&quot;&gt;\mu^{(g)}&lt;/script&gt; of the current generation (the green dot), calculate the covariance matrix &lt;script type=&quot;math/tex&quot;&gt;C^{(g+1)}&lt;/script&gt; of the next generation.&lt;/li&gt;
  &lt;li&gt;Sample a new set of candidate solutions using the updated mean &lt;script type=&quot;math/tex&quot;&gt;\mu^{(g+1)}&lt;/script&gt; and covariance matrix &lt;script type=&quot;math/tex&quot;&gt;C^{(g+1)}&lt;/script&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let’s visualise the scheme one more time, on the entire search process on both problems:&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/schaffer/cmaes2.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/rastrigin/cmaes2.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;Because CMA-ES can adapt both its mean and covariance matrix using information from the best solutions, it can decide to cast a wider net when the best solutions are far away, or narrow the search space when the best solutions are close by.  My description of the CMA-ES algorithm for a 2D toy problem is highly simplified to get the idea across. For more details, I suggest reading the &lt;a href=&quot;https://arxiv.org/abs/1604.00772&quot;&gt;CMA-ES Tutorial&lt;/a&gt; prepared by Nikolaus Hansen, the author of CMA-ES.&lt;/p&gt;

&lt;p&gt;This algorithm is one of the most popular gradient-free optimisation algorithms out there, and has been the algorithm of choice for many researchers and practitioners alike. The only real drawback is the performance if the number of model parameters we need to solve for is large, as the covariance calculation is &lt;script type=&quot;math/tex&quot;&gt;O(N^2)&lt;/script&gt;, although recently there has been approximations to make it &lt;script type=&quot;math/tex&quot;&gt;O(N)&lt;/script&gt;. CMA-ES is my algorithm of choice when the search space is less than a thousand parameters. I found it still usable up to ~ 10K parameters if I’m willing to be patient.&lt;/p&gt;

&lt;h2 id=&quot;natural-evolution-strategies&quot;&gt;Natural Evolution Strategies&lt;/h2&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;Imagine if you had built an artificial life simulator, and you sample a different neural network to control the behavior of each ant inside an ant colony. Using the Simple Evolution Strategy for this task will optimise for traits and behaviours that benefit individual ants, and with each successive generation, our population will be full of alpha ants who only care about their own well-being.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Instead of using a rule that is based on the survival of the fittest ants, what if you take an alternative approach where you take the sum of all fitness values of the entire ant population, and optimise for this sum instead to maximise the well-being of the entire ant population over successive generations? Well, you would end up creating a Marxist utopia.&lt;/em&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;A perceived weakness of the algorithms mentioned so far is that they discard the majority of the solutions and only keep the best solutions. Weak solutions contain information about what &lt;em&gt;not&lt;/em&gt; to do, and this is valuable information to calculate a better estimate for the next generation.&lt;/p&gt;

&lt;p&gt;Many people who studied RL are familiar with the &lt;a href=&quot;http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf&quot;&gt;REINFORCE&lt;/a&gt; paper. In this 1992 paper, Williams outlined an approach to estimate the gradient of the expected rewards with respect to the model parameters of a policy neural network. This paper also proposed using REINFORCE as an Evolution Strategy, in Section 6 of the paper. This special case of &lt;em&gt;REINFORCE-ES&lt;/em&gt; was expanded later on in &lt;a href=&quot;https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=A64D1AE8313A364B814998E9E245B40A?doi=10.1.1.180.7104&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Parameter-Exploring Policy Gradients&lt;/a&gt; (PEPG, 2009) and &lt;a href=&quot;https://www.jmlr.org/papers/volume15/wierstra14a/wierstra14a.pdf&quot;&gt;Natural Evolution Strategies&lt;/a&gt; (NES, 2014).&lt;/p&gt;

&lt;p&gt;In this approach, we want to use all of the information from each member of the population, good or bad, for estimating a gradient signal that can move the entire population to a better direction in the next generation. Since we are estimating a gradient, we can also use this gradient in a standard SGD update rule typically used for deep learning. We can even use this estimated gradient with Momentum SGD, RMSProp, or Adam if we want to.&lt;/p&gt;

&lt;p&gt;The idea is to maximise the &lt;em&gt;expected value&lt;/em&gt; of the fitness score of a sampled solution. If the expected result is good enough, then the best performing member within a sampled population will be even better, so optimising for the expectation might be a sensible approach. Maximising the expected fitness score of a sampled solution is almost the same as maximising the total fitness score of the entire population.&lt;/p&gt;

&lt;p&gt;If &lt;script type=&quot;math/tex&quot;&gt;z&lt;/script&gt; is a solution vector sampled from a probability distribution function &lt;script type=&quot;math/tex&quot;&gt;\pi(z, \theta)&lt;/script&gt;, we can define the expected value of the objective function &lt;script type=&quot;math/tex&quot;&gt;F&lt;/script&gt; as:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;J(\theta) = E_{\theta}[F(z)] = \int F(z) \; \pi(z, \theta) \; dz,&lt;/script&gt;

&lt;p&gt;where &lt;script type=&quot;math/tex&quot;&gt;\theta&lt;/script&gt; are the parameters of the probability distribution function. For example, if &lt;script type=&quot;math/tex&quot;&gt;\pi&lt;/script&gt; is a normal distribution, then &lt;script type=&quot;math/tex&quot;&gt;\theta&lt;/script&gt; would be &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt;. For our simple 2D toy problems, each ensemble &lt;script type=&quot;math/tex&quot;&gt;z&lt;/script&gt; is a 2D vector &lt;script type=&quot;math/tex&quot;&gt;(x, y)&lt;/script&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://www.jmlr.org/papers/volume15/wierstra14a/wierstra14a.pdf&quot;&gt;NES paper&lt;/a&gt; contains a nice derivation of the gradient of &lt;script type=&quot;math/tex&quot;&gt;J(\theta)&lt;/script&gt; with respect to &lt;script type=&quot;math/tex&quot;&gt;\theta&lt;/script&gt;. Using the same &lt;em&gt;log-likelihood trick&lt;/em&gt; as in the REINFORCE algorithm allows us to calculate the gradient of &lt;script type=&quot;math/tex&quot;&gt;J(\theta)&lt;/script&gt;:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\nabla_{\theta} J(\theta) = E_{\theta}[ \; F(z)  \; \nabla_{\theta} \log \pi(z, \theta) \; ].&lt;/script&gt;

&lt;p&gt;In a population size of &lt;script type=&quot;math/tex&quot;&gt;N&lt;/script&gt;, where we have solutions &lt;script type=&quot;math/tex&quot;&gt;z^1&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;z^2&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;...&lt;/script&gt; &lt;script type=&quot;math/tex&quot;&gt;z^N&lt;/script&gt;, we can estimate this gradient as a summation:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \; F(z^i)  \; \nabla_{\theta} \log \pi(z^i, \theta).&lt;/script&gt;

&lt;p&gt;With this gradient &lt;script type=&quot;math/tex&quot;&gt;\nabla_{\theta} J(\theta)&lt;/script&gt;, we can use a learning rate parameter &lt;script type=&quot;math/tex&quot;&gt;\alpha&lt;/script&gt; (such as 0.01) and start optimising the &lt;script type=&quot;math/tex&quot;&gt;\theta&lt;/script&gt; parameters of pdf &lt;script type=&quot;math/tex&quot;&gt;\pi&lt;/script&gt; so that our sampled solutions will likely get higher fitness scores on the objective function &lt;script type=&quot;math/tex&quot;&gt;F&lt;/script&gt;. Using SGD (or Adam), we can update &lt;script type=&quot;math/tex&quot;&gt;\theta&lt;/script&gt; for the next generation:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\theta \rightarrow \theta + \alpha \nabla_{\theta} J(\theta),&lt;/script&gt;

&lt;p&gt;and sample a new set of candidate solutions &lt;script type=&quot;math/tex&quot;&gt;z&lt;/script&gt; from this updated pdf, and continue until we arrive at a satisfactory solution.&lt;/p&gt;

&lt;p&gt;In Section 6 of the &lt;a href=&quot;https://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf&quot;&gt;REINFORCE&lt;/a&gt; paper, Williams derived closed-form formulas of the gradient &lt;script type=&quot;math/tex&quot;&gt;\nabla_{\theta} \log \pi(z^i, \theta)&lt;/script&gt;, for the special case where &lt;script type=&quot;math/tex&quot;&gt;\pi(z, \theta)&lt;/script&gt; is a factored multi-variate normal distribution (i.e., the correlation parameters are zero). In this special case, &lt;script type=&quot;math/tex&quot;&gt;\theta&lt;/script&gt; are the &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt; vectors. Therefore, each element of a solution can be sampled from a univariate normal distribution &lt;script type=&quot;math/tex&quot;&gt;z_j \sim N(\mu_j, \sigma_j)&lt;/script&gt;.&lt;/p&gt;

&lt;p&gt;The closed-form formulas for &lt;script type=&quot;math/tex&quot;&gt;\nabla_{\theta} \log N(z^i, \theta)&lt;/script&gt;, for each individual element of vector &lt;script type=&quot;math/tex&quot;&gt;\theta&lt;/script&gt; on each solution &lt;script type=&quot;math/tex&quot;&gt;i&lt;/script&gt; in the population can be derived as:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\nabla_{\mu_{j}} \log N(z^i, \mu, \sigma) = \frac{z_j^i - \mu_j}{\sigma_j^2},&lt;/script&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\nabla_{\sigma_{j}} \log N(z^i, \mu, \sigma) = \frac{(z_j^i - \mu_j)^2 - \sigma_j^2}{\sigma_j^3}.&lt;/script&gt;

&lt;p&gt;For clarity, I use the index of &lt;script type=&quot;math/tex&quot;&gt;j&lt;/script&gt;, to count across parameter space, and this is not to be confused with superscript &lt;script type=&quot;math/tex&quot;&gt;i&lt;/script&gt;, used to count across each sampled member of the population. For our 2D problems, &lt;script type=&quot;math/tex&quot;&gt;z_1 = x&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;z_2 = y&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;\mu_1 = \mu_x&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;\mu_2 = \mu_y&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;\sigma_1 = \sigma_x&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;\sigma_2 = \sigma_y&lt;/script&gt; in this context.&lt;/p&gt;

&lt;p&gt;These two formulas can be plugged back into the approximate gradient formula to derive explicit update rules for &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt;. In the papers mentioned above, they derived more explicit update rules, incorporated a &lt;em&gt;baseline&lt;/em&gt;, and introduced other tricks such as antithetic sampling in PEPG, which is what my implementation is based on. NES proposed incorporating the inverse of the Fisher Information Matrix into the gradient update rule. But the concept is basically the same as other ES algorithms, where we update the mean and standard deviation of a multi-variate normal distribution at each new generation, and sample a new set of solutions from the updated distribution. Below is a visualization of this algorithm in action, following the formulas described above:&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/schaffer/pepg.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/rastrigin/pepg.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;We see that this algorithm is able to dynamically change the &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt;’s to explore or fine tune the solution space as needed. Unlike CMA-ES, there is no correlation structure in our implementation, so we don’t get the diagonal ellipse samples, only the vertical or horizontal ones, although in principle we can derive update rules to incorporate the entire covariance matrix if we needed to, at the expense of computational efficiency.&lt;/p&gt;

&lt;p&gt;I like this algorithm because like CMA-ES, the &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt;’s can adapt so our search space can be expanded or narrowed over time. Because the correlation parameter is not used in this implementation, the efficiency of the algorithm is &lt;script type=&quot;math/tex&quot;&gt;O(N)&lt;/script&gt; so I use PEPG if the performance of CMA-ES becomes an issue. I usually use PEPG when the number of model parameters exceed several thousand.&lt;/p&gt;

&lt;h2 id=&quot;openai-evolution-strategy&quot;&gt;OpenAI Evolution Strategy&lt;/h2&gt;

&lt;p&gt;In OpenAI’s &lt;a href=&quot;https://blog.openai.com/evolution-strategies/&quot;&gt;paper&lt;/a&gt;, they implement an evolution strategy that is a special case of the REINFORCE-ES algorithm outlined earlier. In particular, &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt; is fixed to a constant number, and only the &lt;script type=&quot;math/tex&quot;&gt;\mu&lt;/script&gt; parameter is updated at each generation. Below is how this strategy looks like, with a constant &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt; parameter:&lt;/p&gt;

&lt;center&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;tr&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/schaffer/openes.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
  &lt;td&gt;
  &lt;img src=&quot;/assets/20171031/rastrigin/oes.gif&quot; width=&quot;100%&quot; /&gt;
  &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;In addition to the simplification, this paper also proposed a modification of the update rule that is suitable for parallel computation across different worker machines. In their update rule, a large grid of random numbers have been pre-computed using a fixed seed.  By doing this, each worker can reproduce the parameters of every other worker over time, and each worker needs only to communicate a single number, the final fitness result, to all of the other workers. This is important if we want to scale evolution strategies to thousands or even a million workers located on different machines, since while it may not be feasible to transmit an entire solution vector a million times at each generation update, it may be feasible to transmit only the final fitness results. In the paper, they showed that by using 1440 workers on Amazon EC2 they were able to solve the Mujoco Humanoid walking task in ~ 10 minutes.&lt;/p&gt;

&lt;p&gt;I think in principle, this parallel update rule should work with the original algorithm where they can also adapt &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt;, but perhaps in practice, they wanted to keep the number of moving parts to a minimum for large-scale parallel computing experiments. This inspiring paper also discussed many other practical aspects of deploying ES for RL-style tasks, and I highly recommend going through it to learn more.&lt;/p&gt;

&lt;h2 id=&quot;fitness-shaping&quot;&gt;Fitness Shaping&lt;/h2&gt;

&lt;p&gt;Most of the algorithms above are usually combined with a &lt;em&gt;fitness shaping&lt;/em&gt; method, such as the rank-based fitness shaping method I will discuss here. Fitness shaping allows us to avoid outliers in the population from dominating the approximate gradient calculation mentioned earlier:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \; F(z^i)  \; \nabla_{\theta} \log \pi(z^i, \theta).&lt;/script&gt;

&lt;p&gt;If a particular &lt;script type=&quot;math/tex&quot;&gt;F(z^m)&lt;/script&gt; is much larger than other &lt;script type=&quot;math/tex&quot;&gt;F(z^i)&lt;/script&gt; in the population, then the gradient might become dominated by this outliers and increase the chance of the algorithm being stuck in a local optimum. To mitigate this, one can apply a rank transformation of the fitness. Rather than use the actual fitness function, we would rank the results and use an augmented fitness function which is proportional to the solution’s rank in the population. Below is a comparison of what the original set of fitness may look like, and what the ranked fitness looks like:&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/assets/20171031/ranked_fitness.svg&quot; width=&quot;100%&quot; /&gt;
&lt;/center&gt;

&lt;p&gt;What this means is supposed we have a population size of 101. We would evaluate each population to the actual fitness function, and then sort the solutions based by their fitness. We will assign an augmented fitness value of -0.50 to the worse performer, -0.49 to the second worse solution, …, 0.49 to the second best solution, and finally a fitness value of 0.50 to the best solution. This augmented set of fitness values will be used to calculate the gradient update, instead of the actual fitness values. In a way, it is a similar to just applying Batch Normalization to the results, but more direct. There are alternative methods for fitness shaping but they all basically give similar results in the end.&lt;/p&gt;

&lt;p&gt;I find fitness shaping to be very useful for RL tasks if the objective function is non-deterministic for a given policy network, which is often the cases on RL environments where maps are randomly generated and various opponents have random policies. It is less useful for optimising for well-behaved functions that are deterministic, and the use of fitness shaping can sometimes slow down the time it takes to find a good solution.&lt;/p&gt;

&lt;h2 id=&quot;mnist&quot;&gt;MNIST&lt;/h2&gt;

&lt;p&gt;Although ES might be a way to search for more novel solutions that are difficult for gradient-based methods to find, it still vastly underperforms gradient-based methods on many problems where we can calculate high quality gradients. For instance, only an idiot would attempt to use a genetic algorithm for image classification. But sometimes &lt;a href=&quot;https://blog.openai.com/nonlinear-computation-in-linear-networks/&quot;&gt;such people&lt;/a&gt; do exist in the world, and sometimes these explorations can be fruitful!&lt;/p&gt;

&lt;p&gt;Since all ML algorithms should be tested on MNIST, I also tried to apply these various ES algorithms to find weights for a small, simple 2-layer convnet used to classify MNIST, just to see where we stand compared to SGD. The convnet only has ~ 11k parameters so we can accommodate the slower CMA-ES algorithm. The code and the experiments are available &lt;a href=&quot;https://github.com/hardmaru/pytorch_notebooks/tree/master/mnist_es&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;!--
______
&lt;code&gt;class Net(nn.Module):&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;def __init__(self):&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;super(Net, self).__init__()&lt;/code&gt;

&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.num_filter1 = 8&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.num_filter2 = 16&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.num_padding = 2&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.filter_size = 5&lt;/code&gt;

&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;# input is 28x28&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.conv1 = nn.Conv2d(1,&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.num_filter1,&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.filter_size,&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;padding=self.num_padding)&lt;/code&gt;

&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;# feature map size is 14*14 by pooling&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.conv2 = nn.Conv2d(self.num_filter1,&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.num_filter2,&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.filter_size,&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;padding=self.num_padding)&lt;/code&gt;

&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;# feature map size is 7*7 by pooling&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;self.fc = nn.Linear(self.num_filter2*7*7, 10)&lt;/code&gt;

&lt;code&gt;&amp;nbsp;&amp;nbsp;def forward(self, x):&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;x = F.max_pool2d(F.relu(self.conv1(x)), 2)&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;x = F.max_pool2d(F.relu(self.conv2(x)), 2)&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;x = x.view(-1, self.num_filter2*7*7)&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;x = self.fc(x)&lt;/code&gt;&lt;br/&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;return F.log_softmax(x)&lt;/code&gt;

______
--&gt;
&lt;p&gt;Below are the results for various ES methods, using a population size of 101, over 300 epochs. We keep track of the model parameters that performed best on the entire training set at the end of each epoch, and evaluate this model once on the test set after 300 epochs. It is interesting how sometimes the test set’s accuracy is higher than the training set for the models that have lower scores.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Method&lt;/th&gt;
      &lt;th&gt;Train Set&lt;/th&gt;
      &lt;th&gt;Test Set&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Adam (BackProp) Baseline&lt;/td&gt;
      &lt;td&gt;99.8&lt;/td&gt;
      &lt;td&gt;98.9&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Simple GA&lt;/td&gt;
      &lt;td&gt;82.1&lt;/td&gt;
      &lt;td&gt;82.4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;CMA-ES&lt;/td&gt;
      &lt;td&gt;98.4&lt;/td&gt;
      &lt;td&gt;98.1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;OpenAI-ES&lt;/td&gt;
      &lt;td&gt;96.0&lt;/td&gt;
      &lt;td&gt;96.2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;PEPG&lt;/td&gt;
      &lt;td&gt;98.5&lt;/td&gt;
      &lt;td&gt;98.0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20171031/mnist_results.svg&quot; width=&quot;100%&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;We should take these results with a grain of salt, since they are based on a single run, rather than the average of 5-10 runs. The results based on a single-run seem to indicate that CMA-ES is the best at the MNIST task, but the PEPG algorithm is not that far off. Both of these algorithms achieved ~ 98% test accuracy, 1% lower than the SGD/ADAM baseline. Perhaps the ability to dynamically alter its covariance matrix, and standard deviation parameters over each generation allowed it to fine-tune its weights better than OpenAI’s simpler variation.&lt;/p&gt;

&lt;h2 id=&quot;try-it-yourself&quot;&gt;Try It Yourself&lt;/h2&gt;

&lt;p&gt;There are probably open source implementations of all of the algorithms described in this article. The author of CMA-ES, Nikolaus Hansen, has been maintaining a numpy-based implementation of &lt;a href=&quot;https://github.com/CMA-ES/pycma&quot;&gt;CMA-ES&lt;/a&gt; with lots of bells and whistles. His python implementation introduced me to the training loop interface described earlier. Since this interface is quite easy to use, I also implemented the other algorithms such as Simple Genetic Algorithm, PEPG, and OpenAI’s ES using the same interface, and put it in a small python file called &lt;code class=&quot;highlighter-rouge&quot;&gt;es.py&lt;/code&gt;, and also wrapped the original CMA-ES library in this small library. This way, I can quickly compare different ES algorithms by just changing one line:&lt;/p&gt;

&lt;hr /&gt;
&lt;p&gt;&lt;code&gt;import es&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;#solver = es.SimpleGA(...)&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;#solver = es.PEPG(...)&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;#solver = es.OpenES(...)&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;solver = es.CMAES(...)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;while True:&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;solutions = solver.ask()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;fitness_list = np.zeros(solver.popsize)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;for i in range(solver.popsize):&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;fitness_list[i] = evaluate(solutions[i])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;solver.tell(fitness_list)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;result = solver.result()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;if result[1] &amp;gt; MY_REQUIRED_FITNESS:&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;break&lt;/code&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;You can look at &lt;code class=&quot;highlighter-rouge&quot;&gt;es.py&lt;/code&gt; on &lt;a href=&quot;https://github.com/hardmaru/estool/blob/master/es.py&quot;&gt;GitHub&lt;/a&gt; and the IPython notebook &lt;a href=&quot;https://github.com/hardmaru/estool/blob/master/simple_es_example.ipynb&quot;&gt;examples&lt;/a&gt; using the various ES algorithms.&lt;/p&gt;

&lt;p&gt;In this &lt;a href=&quot;https://github.com/hardmaru/estool/blob/master/simple_es_example.ipynb&quot;&gt;IPython notebook&lt;/a&gt; that accompanies &lt;code class=&quot;highlighter-rouge&quot;&gt;es.py&lt;/code&gt;, I show how to use the ES solvers in &lt;code class=&quot;highlighter-rouge&quot;&gt;es.py&lt;/code&gt; to solve a 100-Dimensional version of the Rastrigin function with even more local optimum points. The 100-D version is somewhat more challenging than the trivial 2D version used to produce the visualizations in this article. Below is a comparison of the performance for various algorithms discussed:&lt;/p&gt;

&lt;center&gt;
&lt;img src=&quot;/assets/20171031/rastrigin10d.svg&quot; width=&quot;100%&quot; /&gt;
&lt;/center&gt;

&lt;p&gt;On this 100-D Rastrigin problem, none of the optimisers got to the global optimum solution, although CMA-ES comes close. CMA-ES blows everything else away. PEPG is in 2nd place, and OpenAI-ES / Genetic Algorithm falls behind. I had to use an annealing schedule to gradually lower &lt;script type=&quot;math/tex&quot;&gt;\sigma&lt;/script&gt; for OpenAI-ES to make it perform better for this task.&lt;/p&gt;
&lt;p&gt;&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/assets/20171031/rastrigin_cma_solution.png&quot; width=&quot;60%&quot; /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Final solution that CMA-ES discovered for 100-D Rastrigin function.&lt;br /&gt;Global optimal solution is a 100-dimensional vector of exactly 10.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;h2 id=&quot;whats-next&quot;&gt;What’s Next?&lt;/h2&gt;

&lt;center&gt;
&lt;blockquote class=&quot;twitter-video&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;so proud of my little dude ... &lt;a href=&quot;https://t.co/j5j61vQxP0&quot;&gt;pic.twitter.com/j5j61vQxP0&lt;/a&gt;&lt;/p&gt;&amp;mdash; hardmaru (@hardmaru) &lt;a href=&quot;https://twitter.com/hardmaru/status/889265345172615168?ref_src=twsrc%5Etfw&quot;&gt;July 23, 2017&lt;/a&gt;&lt;/blockquote&gt; &lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt; 
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;In the &lt;a href=&quot;/2017/11/12/evolving-stable-strategies/&quot;&gt;next article&lt;/a&gt;, I will look at applying ES to other experiments and more interesting problems. Please &lt;a href=&quot;/2017/11/12/evolving-stable-strategies/&quot;&gt;check&lt;/a&gt; it out!&lt;/p&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you find this work useful, please cite it as:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;
@article{ha2017visual,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;title&amp;nbsp;&amp;nbsp;&amp;nbsp;= &quot;A Visual Guide to Evolution Strategies&quot;,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;author&amp;nbsp;&amp;nbsp;= &quot;Ha, David&quot;,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;journal&amp;nbsp;= &quot;blog.otoro.net&quot;,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;year&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;= &quot;2017&quot;,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;url&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;= &quot;https://blog.otoro.net/2017/10/29/visual-evolution-strategies/&quot;&lt;br /&gt;
}
&lt;/code&gt;&lt;/p&gt;

&lt;h2 id=&quot;references-and-other-links&quot;&gt;References and Other Links&lt;/h2&gt;

&lt;p&gt;Below are a few links to information related to evolutionary computing which I found useful or inspiring.&lt;/p&gt;

&lt;p&gt;Image Credits of &lt;a href=&quot;https://www.reddit.com/r/CryptoMarkets/comments/6qpla3/investing_in_icos_results_may_vary/&quot;&gt;Lemmings Jumping off a Cliff&lt;/a&gt;. Your results may vary when investing in ICOs.&lt;/p&gt;

&lt;p&gt;CMA-ES: &lt;a href=&quot;https://github.com/CMA-ES&quot;&gt;Official Reference Implementation&lt;/a&gt; on GitHub, &lt;a href=&quot;https://arxiv.org/abs/1604.00772&quot;&gt;Tutorial&lt;/a&gt;, Original CMA-ES &lt;a href=&quot;http://www.cmap.polytechnique.fr/~nikolaus.hansen/cmaartic.pdf&quot;&gt;Paper&lt;/a&gt; from 2001, Overview &lt;a href=&quot;https://www.slideshare.net/OsamaSalaheldin2/cmaes-presentation&quot;&gt;Slides&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf&quot;&gt;Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning&lt;/a&gt; (REINFORCE), 1992.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=A64D1AE8313A364B814998E9E245B40A?doi=10.1.1.180.7104&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Parameter-Exploring Policy Gradients&lt;/a&gt;, 2009.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.jmlr.org/papers/volume15/wierstra14a/wierstra14a.pdf&quot;&gt;Natural Evolution Strategies&lt;/a&gt;, 2014.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://blog.openai.com/evolution-strategies/&quot;&gt;Evolution Strategies as a Scalable Alternative to Reinforcement Learning&lt;/a&gt;, OpenAI, 2017.&lt;/p&gt;

&lt;p&gt;Risto Miikkulainen’s &lt;a href=&quot;http://nn.cs.utexas.edu/downloads/slides/miikkulainen.ijcnn13.pdf&quot;&gt;Slides&lt;/a&gt; on Neuroevolution.&lt;/p&gt;

&lt;p&gt;A Neuroevolution Approach to &lt;a href=&quot;http://www.cs.utexas.edu/~ai-lab/?atari&quot;&gt;General Atari Game Playing&lt;/a&gt;, 2013.&lt;/p&gt;

&lt;p&gt;Kenneth Stanley’s Talk on &lt;a href=&quot;https://youtu.be/dXQPL9GooyI&quot;&gt;Why Greatness Cannot Be Planned: The Myth of the Objective&lt;/a&gt;, 2015.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/ideas/neuroevolution-a-different-kind-of-deep-learning&quot;&gt;Neuroevolution&lt;/a&gt;: A Different Kind of Deep Learning. The quest to evolve neural networks through evolutionary algorithms.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://people.idsia.ch/~juergen/compressednetworksearch.html&quot;&gt;Compressed Network Search&lt;/a&gt; Finds Complex Neural Controllers with a Million Weights.&lt;/p&gt;

&lt;p&gt;Karl Sims &lt;a href=&quot;https://youtu.be/JBgG_VSP7f8&quot;&gt;Evolved Virtual Creatures&lt;/a&gt;, 1994.&lt;/p&gt;

&lt;p&gt;Evolved &lt;a href=&quot;https://youtu.be/euFvRfQRbLI&quot;&gt;Step Climbing&lt;/a&gt; Creatures.&lt;/p&gt;

&lt;p&gt;Super Mario World Agent &lt;a href=&quot;https://youtu.be/qv6UVOQ0F44&quot;&gt;Mario I/O&lt;/a&gt;, Mario Kart 64 &lt;a href=&quot;(https://github.com/nicknlsn/MarioKart64NEAT)&quot;&gt;Controller using&lt;/a&gt; using &lt;a href=&quot;https://www.cs.ucf.edu/~kstanley/neat.html&quot;&gt;NEAT Algorithm&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.bionik.tu-berlin.de/institut/xstart.htm&quot;&gt;Ingo Rechenberg&lt;/a&gt;, the inventor of Evolution Strategies.&lt;/p&gt;

&lt;p&gt;A Tutorial on &lt;a href=&quot;https://pablormier.github.io/2017/09/05/a-tutorial-on-differential-evolution-with-python/&quot;&gt;Differential Evolution&lt;/a&gt; with Python.&lt;/p&gt;

&lt;h3 id=&quot;my-previous-evolutionary-projects&quot;&gt;My Previous &lt;em&gt;Evolutionary&lt;/em&gt; Projects&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://deepmind.com/research/publications/pathnet-evolution-channels-gradient-descent-super-neural-networks/&quot;&gt;PathNet&lt;/a&gt;: Evolution Channels Gradient Descent in Super Neural Networks&lt;/p&gt;

&lt;p&gt;Neural Network Evolution Playground with &lt;a href=&quot;https://blog.otoro.net/2016/05/07/backprop-neat/&quot;&gt;Backprop NEAT&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Evolved Neural &lt;a href=&quot;https://otoro.net/gallery&quot;&gt;Art Gallery&lt;/a&gt; using &lt;a href=&quot;https://otoro.net/neurogram/&quot;&gt;CPPN&lt;/a&gt; Implementation&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://otoro.net/planks/&quot;&gt;Creatures Avoiding Planks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://otoro.net/nabi/slimevolley/index.html&quot;&gt;Neural Slime Volleyball&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Evolution of &lt;a href=&quot;https://otoro.net/ml/pendulum-esp/index.html&quot;&gt;Inverted Double Pendulum&lt;/a&gt; Swing Up Controller&lt;/p&gt;
</description>
        <pubDate>Sun, 29 Oct 2017 00:00:00 -0500</pubDate>
      </item>
    
      <item>
        <title>Teaching Machines to Draw</title>
        <link>https://blog.otoro.net/2017/05/19/teaching-machines-to-draw/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2017/05/19/teaching-machines-to-draw/</guid>
        <description>&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/sketch_rnn_examples.svg&quot; width=&quot;100%&quot; /&gt;
&lt;p&gt;&lt;/p&gt;

Latent space interpolation of various vector drawings produced by &lt;code class=&quot;highlighter-rouge&quot;&gt;sketch-rnn&lt;/code&gt;.&lt;br /&gt;
&lt;code&gt;
&lt;a href=&quot;https://github.com/tensorflow/magenta/blob/master/magenta/models/sketch_rnn/README.md&quot;&gt;GitHub&lt;/a&gt;
&lt;/code&gt;

&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;This is an updated version of my article, cross-posted on the Google Research &lt;a href=&quot;https://research.googleblog.com/2017/04/teaching-machines-to-draw.html&quot;&gt;Blog&lt;/a&gt;.  Instructions on using the &lt;code class=&quot;highlighter-rouge&quot;&gt;sketch-rnn&lt;/code&gt; model is available at Google Brain &lt;a href=&quot;https://magenta.tensorflow.org/sketch_rnn&quot;&gt;Magenta Project&lt;/a&gt;.  Link to our paper, “&lt;a href=&quot;https://arxiv.org/abs/1704.03477&quot;&gt;A Neural Representation of Sketch Drawings&lt;/a&gt;”.  This article has also been translated to &lt;a href=&quot;https://www.jqr.com/news/009523&quot;&gt;Simplified Chinese&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/frog_crab_cat.png&quot; width=&quot;100%&quot; /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Vector drawings produced by our model.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;Recently, there have been major advancements in generative modelling of images using neural networks as a generative tool. While there is a already a &lt;a href=&quot;https://github.com/carpedm20/BEGAN-tensorflow/blob/master/README.md&quot;&gt;large&lt;/a&gt; &lt;a href=&quot;https://affinelayer.com/pixsrv/&quot;&gt;body&lt;/a&gt; &lt;a href=&quot;https://github.com/carpedm20/DCGAN-tensorflow&quot;&gt;of&lt;/a&gt; &lt;a href=&quot;https://github.com/skaae/vaeblog&quot;&gt;existing&lt;/a&gt; &lt;a href=&quot;https://github.com/junyanz/CycleGAN/blob/master/README.md&quot;&gt;work&lt;/a&gt; on generative modelling of images using neural networks, most of the work thus far has been targeted towards modelling low resolution, pixel images.&lt;/p&gt;

&lt;p&gt;Humans, however, do not understand the world as a grid of pixels, but rather develop abstract concepts to represent what we see. From a young age, we develop the ability to communicate what we see by drawing on a piece of paper with a pencil. In this way we learn to express a sequential, &lt;em&gt;vector&lt;/em&gt; representation of an image as a short sequence of strokes.  In this work, we investigate an alternative to traditional pixel image modelling approaches, and propose a generative model for vector images.&lt;/p&gt;

&lt;center&gt;
&lt;blockquote class=&quot;twitter-video&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Humans learn to draw sequentially. Designers rely on vector graphics. Yet most ML Research focus only on generative models for pixel images. &lt;a href=&quot;https://t.co/3VHe3HmFCi&quot;&gt;pic.twitter.com/3VHe3HmFCi&lt;/a&gt;&lt;/p&gt;&amp;mdash; hardmaru (@hardmaru) &lt;a href=&quot;https://twitter.com/hardmaru/status/866055378005401600&quot;&gt;May 20, 2017&lt;/a&gt;&lt;/blockquote&gt; &lt;script async=&quot;&quot; src=&quot;//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;i&gt;Children learn to draw &lt;a href=&quot;https://en.wikipedia.org/wiki/Doraemon_(character)&quot;&gt;Doraemon&lt;/a&gt; as a sequential set of strokes.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;Children develop the ability to depict objects, and arguably even emotions, with only a few pen strokes. They learn to draw their favourite anime characters, family, friends and familiar places. These simple drawings may not resemble reality as captured by a photograph, but they do tell us something about how people represent and reconstruct images of the world around them.&lt;/p&gt;

&lt;hr /&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;&lt;b&gt;&lt;i&gt;“The function of vision is to update the internal model of the world inside our head, but what we put on a piece of paper is the internal model.”&lt;/i&gt;&lt;/b&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
 &amp;mdash; Harold Cohen, &lt;a href=&quot;https://youtu.be/Xlhd8iP1hXo?t=20m&quot;&gt;Reflections on Design and Building AARON&lt;/a&gt;.
&lt;p&gt;
&lt;/p&gt;
&lt;/center&gt;
&lt;hr /&gt;

&lt;p&gt;In our paper, “&lt;a href=&quot;https://arxiv.org/abs/1704.03477&quot;&gt;A Neural Representation of Sketch Drawings&lt;/a&gt;”, we present a generative recurrent neural network capable of producing sketches of common objects, with the goal of training a machine to draw and generalize abstract concepts in a manner similar to humans. We train our model on a &lt;a href=&quot;https://quickdraw.withgoogle.com/data&quot;&gt;dataset&lt;/a&gt; of hand-drawn sketches, each represented as a sequence of motor actions controlling a pen: which direction to move, when to lift the pen up, and when to stop drawing. In doing so, we created a model that potentially has many applications, from assisting the creative process of an artist, to helping teach students how to draw.&lt;/p&gt;

&lt;p&gt;In this work, we model a vector-based representation of images inspired by how people draw. We use recurrent neural networks as our generative model. Not only can our recurrent neural network generate individual vector drawings by constructing a sequence of strokes, like these previous experiments on Generative &lt;a href=&quot;https://blog.otoro.net/2015/12/12/handwriting-generation-demo-in-tensorflow/&quot;&gt;Handwriting&lt;/a&gt; and Generative &lt;a href=&quot;https://blog.otoro.net/2015/12/28/recurrent-net-dreams-up-fake-chinese-characters-in-vector-format-with-tensorflow/&quot;&gt;Kanji&lt;/a&gt;, our model can also generate a vector drawing conditional on a &lt;em&gt;latent vector&lt;/em&gt;, &lt;script type=&quot;math/tex&quot;&gt;z&lt;/script&gt;, as an input into the model.&lt;/p&gt;

&lt;p&gt;Similar to a previous &lt;a href=&quot;https://blog.otoro.net/2016/04/01/generating-large-images-from-latent-vectors/&quot;&gt;work&lt;/a&gt; where we interpolate between multiple latent vectors to generate animated high-resolution morphing MNIST animations, we can train our model on hand-drawn sketches from the &lt;em&gt;yoga&lt;/em&gt; category of the &lt;a href=&quot;https://quickdraw.withgoogle.com/data/yoga&quot;&gt;QuickDraw&lt;/a&gt; dataset, and have it dream up yoga positions in both time and space directions.&lt;/p&gt;

&lt;center&gt;
&lt;blockquote class=&quot;twitter-video&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;An RNN&amp;#39;s Understanding of Yoga. &lt;a href=&quot;https://t.co/0E4AJ3B49X&quot;&gt;pic.twitter.com/0E4AJ3B49X&lt;/a&gt;&lt;/p&gt;&amp;mdash; hardmaru (@hardmaru) &lt;a href=&quot;https://twitter.com/hardmaru/status/852943471866281985&quot;&gt;April 14, 2017&lt;/a&gt;&lt;/blockquote&gt;&lt;script async=&quot;&quot; src=&quot;//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
&lt;i&gt;“Generating sequential data is the closest computers get to dreaming.”&lt;br /&gt;&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;h2 id=&quot;a-generative-model-for-vector-drawings&quot;&gt;A Generative Model for Vector Drawings&lt;/h2&gt;

&lt;p&gt;Our model, &lt;code class=&quot;highlighter-rouge&quot;&gt;sketch-rnn&lt;/code&gt;, is based on the &lt;a href=&quot;https://www.wildml.com/2016/08/rnns-in-tensorflow-a-practical-guide-and-undocumented-features/&quot;&gt;sequence-to-sequence&lt;/a&gt; (seq2seq) autoencoder framework. It incorporates &lt;a href=&quot;https://jmetzen.github.io/2015-11-27/vae.html&quot;&gt;variational inference&lt;/a&gt; and utilizes &lt;a href=&quot;https://blog.otoro.net/2016/09/28/hyper-networks/&quot;&gt;Hyper Networks&lt;/a&gt; as recurrent neural network cells. The goal of a seq2seq autoencoder is to train a network to encode an input sequence into a vector of floating point numbers, called a latent vector, and from this latent vector reconstruct an output sequence using a decoder that replicates the input sequence as closely as possible.&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/sketch_rnn_schematic.svg&quot; width=&quot;100%&quot; /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Schematic of &lt;code class=&quot;highlighter-rouge&quot;&gt;sketch-rnn&lt;/code&gt;.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;In our model, we deliberately add noise to the latent vector. In our paper, we show that by inducing noise into the communication channel between the encoder and the decoder, the model is no longer be able to reproduce the input sketch exactly, but instead must learn to capture the essence of the sketch as a noisy latent vector. Our decoder takes this latent vector and produces a sequence of motor actions used to construct a new sketch. In the figure below, we feed several actual sketches of cats into the encoder to produce reconstructed sketches using the decoder.&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/vae_cats.svg&quot; width=&quot;80%&quot; /&gt;&lt;br /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Reconstructions from a model trained on cat sketches sampled at varying temperature levels.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;It is important to emphasize that the reconstructed cat sketches are not copies of the input sketches, but are instead new sketches of cats with similar characteristics as the inputs. To demonstrate that the model is not simply copying from the input sequence, and that it actually learned something about the way people draw cats, we can try to feed in non-standard sketches into the encoder.  When we feed in a sketch of a three-eyed cat, the model generates a similar looking cat that has two eyes instead, suggesting that our model has learned that cats usually only have two eyes.&lt;/p&gt;

&lt;p&gt;To show that our model is not simply choosing the closest normal-looking cat from a large collection of memorized cat-sketches, we can try to input something totally different, like a sketch of a toothbrush. We see that the network generates a cat-like figure with long whiskers that mimics the features and orientation of the toothbrush. This suggests that the network has learned to encode an input sketch into a set of abstract cat-concepts embedded into the latent vector, and is also able to reconstruct an entirely new sketch based on this latent vector.&lt;/p&gt;

&lt;p&gt;Not convinced? We repeat the experiment again on a model trained on pig sketches and arrive at similar conclusions. When presented with an eight-legged pig, the model generates a similar pig with only four legs. If we feed a truck into the pig-drawing model, we get a pig that looks a bit like the truck.&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/vae_pigs.svg&quot; width=&quot;80%&quot; /&gt;&lt;br /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Reconstructions from a model trained on pig sketches sampled at varying temperature levels.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;To investigate how these latent vectors encode conceptual animal features, in the figure below, we first obtain two latent vectors encoded from two very different pigs, in this case a pig head (in the green box) and a full pig (in the orange box). We want to get a sense of how our model learned to represent pigs, and one way to do this is to interpolate between the two different latent vectors, and visualize each generated sketch from each interpolated latent vector. In the figure below, we visualize how the sketch of the pig head slowly morphs into the sketch of the full pig, and in the process show how the model organizes the concepts of pig sketches. We see that the latent vector controls the relatively position and size of the nose relative to the head, and also the existence of the body and legs in the sketch.&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/pig_morph.png&quot; width=&quot;65%&quot; /&gt;&lt;br /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Latent space interpolations generated from a model trained on pig sketches.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;We would also like to know if our model can learn representations of multiple animals, and if so, what would they look like? In the figure below, we generate sketches from interpolating latent vectors between a cat head and a full pig. We see how the representation slowly transitions from a cat head, to a cat with a tail, to a cat with a fat body, and finally into a full pig. Like a child learning to draw animals, our model learns to construct animals by attaching a head, feet, and a tail to its body. We see that the model is also able to draw cat heads that are distinct from pig heads.&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/vae_morphs.svg&quot; width=&quot;80%&quot; /&gt;&lt;br /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Latent Space Interpolations from a model trained on sketches of both cats and pigs.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;These interpolation examples suggest that the latent vectors indeed encode conceptual features of a sketch. But can we use these features to augment other sketches without such features - for example, adding a body to a cat’s head?&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/vae_analogy.svg&quot; width=&quot;80%&quot; /&gt;&lt;br /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Learned relationships between abstract concepts, explored using latent vector arithmetic.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;Indeed, we find that sketch drawing analogies are possible for our model trained on both cat and pig sketches. For example, we can subtract the latent vector of an encoded pig head from the latent vector of a full pig, to arrive at a vector that represents the concept of a body. Adding this difference to the latent vector of a cat head results in a full cat (i.e. cat head + body = full cat). These drawing analogies allow us to explore how the model organizes its latent space to represent different concepts in the manifold of generated sketches.&lt;/p&gt;

&lt;h2 id=&quot;creative-applications&quot;&gt;Creative Applications&lt;/h2&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/multiple_interpolations.png&quot; width=&quot;100%&quot; /&gt;&lt;br /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Exploring the latent space of generated sketches of everyday objects.&lt;br /&gt;
Latent space interpolation from left to right, and then top to bottom.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;In addition to the research component of this work, we are also super excited about potential creative applications of &lt;code class=&quot;highlighter-rouge&quot;&gt;sketch-rnn&lt;/code&gt;. For instance, even in the simplest use case, pattern designers can apply &lt;code class=&quot;highlighter-rouge&quot;&gt;sketch-rnn&lt;/code&gt; to generate a large number of similar, but unique designs for textile or wallpaper prints.&lt;/p&gt;

&lt;p&gt;As we saw earlier, a model trained to draw pigs can be made to draw pig-like trucks if given an input sketch of a truck. We can extend this result to applications that might help creative designers come up with abstract designs that can resonate more with their target audience.&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/cat_vae.png&quot; width=&quot;47%&quot; /&gt;&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/morph_catchairs.svg&quot; width=&quot;53%&quot; /&gt;&lt;br /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;Similar, but unique cats, generated from a single input sketch in the greenbox (left).&lt;br /&gt;
Exploring the latent space of generated chair-cats (right).
&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;For instance, in the earlier figure above, we feed sketches of four different chairs into our cat-drawing model to produce four chair-like cats. We can go further and incorporate the interpolation methodology described earlier to explore the latent space of chair-like cats, and produce a large grid of generated designs to select from.&lt;/p&gt;

&lt;p&gt;Exploring the latent space between different objects can potentially enable creative designers to find interesting intersections and relationships between different drawings:&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/catbus.svg&quot; width=&quot;80%&quot; /&gt;
&lt;/center&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/catbus2.svg&quot; width=&quot;80%&quot; /&gt;
&lt;/center&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/elephantpig.svg&quot; width=&quot;80%&quot; /&gt;
&lt;/center&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/owlmorph.svg&quot; width=&quot;80%&quot; /&gt;&lt;br /&gt;
&lt;i&gt;Exploring the latent space between cats and buses, elephants and pigs, and various owls.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;We can also use the decoder module of &lt;code class=&quot;highlighter-rouge&quot;&gt;sketch-rnn&lt;/code&gt; as a standalone model and train it to predict different possible endings of incomplete sketches. This technique can lead to applications where the model assists the creative process of an artist by suggesting alternative ways to finish an incomplete sketch. In the figure below, we draw different incomplete sketches (in red), and have the model come up with different possible ways to complete the drawings.&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/sketch-rnn/master/example/full_predictions.svg&quot; width=&quot;100%&quot; /&gt;&lt;br /&gt;
&lt;p&gt;&lt;/p&gt;
&lt;i&gt;The model can start with incomplete sketches and automatically generate different completions.&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;We believe the best creative works will not be created only with machines, but possibly by designers who use machine learning as a tool to enrich their creative thinking process. In the future, we envision how these tools can be used collaboratively with artists and designers. Below is a simple conceptual example illustrating this collaboration using our model:&lt;/p&gt;

&lt;p&gt;
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;https://otoro.net/img/make_it_rain.gif&quot; /&gt;&lt;br /&gt;
&lt;i&gt;“Making it rain with recurrent neural nets.”&lt;/i&gt;
&lt;/center&gt;
&lt;p&gt;
&lt;/p&gt;

&lt;p&gt;We are very excited about the future possibilities of generative vector image modelling. These models will enable many exciting new creative applications in a variety of different directions. They can also serve as a tool to help us improve our understanding of our own creative thought processes. Learn more about &lt;code class=&quot;highlighter-rouge&quot;&gt;sketch-rnn&lt;/code&gt; by reading our paper, “&lt;a href=&quot;https://arxiv.org/abs/1704.03477&quot;&gt;A Neural Representation of Sketch Drawings&lt;/a&gt;”.&lt;/p&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you find this work useful, please cite it as:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;
@article{ha2017neural,&lt;br /&gt;
&amp;nbsp;&amp;nbsp;title={A neural representation of sketch drawings},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;author={Ha, David and Eck, Douglas},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;journal={arXiv preprint arXiv:1704.03477},&lt;br /&gt;
&amp;nbsp;&amp;nbsp;year={2017}&lt;br /&gt;
}
&lt;/code&gt;&lt;/p&gt;

</description>
        <pubDate>Fri, 19 May 2017 00:00:00 -0500</pubDate>
      </item>
    
      <item>
        <title>Recurrent Neural Network Tutorial for Artists</title>
        <link>https://blog.otoro.net/2017/01/01/recurrent-neural-network-artist/</link>
        <guid isPermaLink="true">https://blog.otoro.net/2017/01/01/recurrent-neural-network-artist/</guid>
        <description>&lt;center&gt;
&lt;img src=&quot;https://cdn.rawgit.com/hardmaru/rnn-tutorial/master/neural.svg&quot; width=&quot;100%&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;This post is not meant to be a comprehensive overview of recurrent neural networks.  It is intended for readers without any machine learning background.  The goal is to show artists and designers how to use a pre-trained neural network to produce interactive digital works using simple Javascript and &lt;a href=&quot;https://p5js.org/&quot;&gt;p5.js&lt;/a&gt; library.&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;div id=&quot;sketch01&quot;&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;center&gt;&lt;i&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://otoro.net/ml/rnn-tutorial&quot;&gt;Handwriting Generation with Javascript&lt;/a&gt;&lt;/i&gt;&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;Machine learning has become a popular tool for the creative community in recent years. Techniques such as &lt;a href=&quot;https://github.com/lengstrom/fast-style-transfer&quot;&gt;style transfer&lt;/a&gt;, &lt;a href=&quot;https://aiexperiments.withgoogle.com/drum-machine&quot;&gt;t-sne&lt;/a&gt;, &lt;a href=&quot;https://gabgoh.github.io/ThoughtVectors/&quot;&gt;autoencoders&lt;/a&gt;, &lt;a href=&quot;https://opendot.github.io/ml4a-invisible-cities/&quot;&gt;generative adversarial networks&lt;/a&gt;, and &lt;a href=&quot;http://www.evolvingai.org/InnovationEngine&quot;&gt;countless&lt;/a&gt; other methods have made their way into the digital artist’s toolbox. Many &lt;a href=&quot;http://golancourses.net/2016/lectures/3-15/3-15-machine-learning/&quot;&gt;techniques&lt;/a&gt; take advantage of &lt;a href=&quot;https://medium.com/@kcimc/a-return-to-machine-learning-2de3728558eb&quot;&gt;convolutional neural networks&lt;/a&gt; for feature extraction and feature processing.&lt;/p&gt;

&lt;p&gt;On the other end of the spectrum, recurrent neural networks, and other &lt;a href=&quot;https://www.asimovinstitute.org/analyzing-deep-learning-tools-music/&quot;&gt;autoregressive models&lt;/a&gt; enable powerful tools that can generate realistic sequential data.  Artists have employed such techniques to &lt;a href=&quot;https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0#.jikh6uesi&quot;&gt;generate&lt;/a&gt; &lt;a href=&quot;https://www.robinsloan.com/notes/writing-with-the-machine/&quot;&gt;text&lt;/a&gt;, and &lt;a href=&quot;https://www.technologyreview.com/s/603003/ai-songsmith-cranks-out-surprisingly-catchy-tunes/&quot;&gt;music&lt;/a&gt; and &lt;a href=&quot;https://github.com/ibab/tensorflow-wavenet&quot;&gt;sounds&lt;/a&gt;.  One of the areas I feel lacking focus at the moment is on the generation of vector artwork, perhaps due to the lack of available data.&lt;/p&gt;

&lt;p&gt;Handwriting is a form of sketch artwork.  Recently, I have collaborated with &lt;a href=&quot;https://twitter.com/shancarter&quot;&gt;Shan Carter&lt;/a&gt;, &lt;a href=&quot;https://twitter.com/enjalot&quot;&gt;Ian Johnson&lt;/a&gt;, and &lt;a href=&quot;https://twitter.com/ch402&quot;&gt;Chris Olah&lt;/a&gt; to publish a &lt;a href=&quot;https://distill.pub/2016/handwriting/&quot;&gt;post&lt;/a&gt; on &lt;a href=&quot;https://distill.pub/2016/handwriting/&quot;&gt;distill.pub&lt;/a&gt; on handwriting generation.  In particular, the experiments in the post help visualise the internals of a recurrent neural network trained to generate handwriting.  The truth is, that project also served as a kind of meta-experiment for myself.  Rather than directly working on the visualisation experiments and writeup, I set out to create a pre-trained handwriting model with an easy-to-use Javascript interface, and have my collaborators, who are highly talented data visualisation artists, experiment with the model to create something out of it.  They ended up creating the beautiful interactive visualization experiments in the &lt;a href=&quot;https://distill.pub/2016/handwriting/&quot;&gt;distill.pub&lt;/a&gt; post.&lt;/p&gt;

&lt;!--The experiment made me appreciate the power of abstraction, in the context of complex systems.  Architects design beautiful buildings without the need to understand low-level physical properties of glass, steel, or concrete.  The materials are engineered with high-level specification of safety limits, and beautiful structures can be built from these materials as long as they operate within a safety framework.  Although there are architects who are also capable engineers who understand both professions, I feel the need to constantly think of low level details will limit the creative process during design.  By hiding complexity, these abstraction layers help designers to focus on creative output.--&gt;

&lt;p&gt;I decided to write this post and make available the same handwriting model used in the &lt;a href=&quot;https://distill.pub/2016/handwriting/&quot;&gt;distill.pub&lt;/a&gt; project along with explanations, with the hope that other artists and designers can also take advantage of these technologies and even go deeper into the field.&lt;/p&gt;

&lt;h2 id=&quot;modelling-a-handwriting-brain&quot;&gt;Modelling a Handwriting Brain&lt;/h2&gt;

&lt;p&gt;There are many things going on in our brain when we are writing a letter.  Based on what we set out to accomplish by writing, we make a plan about what we are going to write, select a suitable choice of vocabulary, how neat our handwriting needs to be, and then pick up then pen and start writing something on a pad of paper, making decisions about where to place the pen, where to move it, and when to pick it up.&lt;/p&gt;

&lt;p&gt;It would be difficult to create a Javascript model to simulate the entire human brain for writing a letter, but we can instead try to &lt;em&gt;model&lt;/em&gt; the handwriting brain approximately by focusing on the last part of the handwriting process, namely where to place the pen, where to move it, and when to pick it up.  So our model of the handwriting process will only care about the location of the pen, and whether the pen is touching the paper pad.&lt;/p&gt;

&lt;p&gt;We also make two assumptions about the model.  The first assumption is that the decision of what the model will write next will only depend on whatever it wrote in the past.  However, when we write things, while we remember precisely the details of the last pen stroke, we don’t actually remember exactly what we wrote many strokes ago, and only have a vague idea about what was written.  This &lt;em&gt;vague idea&lt;/em&gt; about what was written before can in fact be modelled within the context of a &lt;em&gt;recurrent neural network&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;With an RNN, we can store this type of vague knowledge directly into the neurons of the RNN, and we refer to this object as the &lt;em&gt;hidden state&lt;/em&gt; of the RNN.  This hidden state is just a vector of floating point numbers that keep track of how active each neuron is.  What our model will write next, will therefore depend on its hidden state.  This hidden state object will keep on getting updated after something is written, so it will be constantly changing.  We will demonstrate how this works in the next section.&lt;/p&gt;

&lt;p&gt;The second assumption about the model, is that that the model will not be &lt;em&gt;absolutely certain&lt;/em&gt; about what it should write next.  In fact, the decision of what the model will write next is &lt;em&gt;random&lt;/em&gt;.  For example, when the model is writing the character &lt;script type=&quot;math/tex&quot;&gt;y&lt;/script&gt;, it may decide to either continue writing the character to make the bottom hook of the &lt;script type=&quot;math/tex&quot;&gt;y&lt;/script&gt; character larger, or it can decide to suddenly finish off the character and move the pen to another location.  Therefore, the output of our model will not be precisely what to write next, but actually a &lt;em&gt;probability distribution&lt;/em&gt; of what to write next.  We will need to sample from this probability distribution to decide what to actually write next.&lt;/p&gt;

&lt;p&gt;These two assumptions can be summarised in the following diagram, which describes the process of using a Recurrent Neural Network model with a hidden state to generate a random sequence.&lt;/p&gt;

&lt;center&gt;
  &lt;br /&gt;
  &lt;div id=&quot;state_diagram&quot; class=&quot;wp-caption center&quot;&gt;
    &lt;a target=&quot;_blank&quot; href=&quot;    /wp-content/uploads/sites/2/2015/12/state_diagram.svg&quot;&gt;&lt;img src=&quot;    /wp-content/uploads/sites/2/2015/12/state_diagram.svg&quot; alt=&quot;state_diagram&quot; class=&quot;alignnone size-medium wp-image-983&quot; /&gt;&lt;/a&gt;
    &lt;p class=&quot;wp-caption-text&quot;&gt;
      Generative Sequence Model Framework&lt;br /&gt;
    &lt;/p&gt;
  &lt;/div&gt;
  &lt;br /&gt;
&lt;/center&gt;

&lt;p&gt;Don’t worry if you don’t fully understand this diagram.  In the next section, we will demonstrate what is going on line-by-line with Javascript.&lt;/p&gt;

&lt;h2 id=&quot;recurrent-neural-network-for-handwriting&quot;&gt;Recurrent Neural Network for Handwriting&lt;/h2&gt;

&lt;p&gt;We have pre-trained a recurrent neural network &lt;a href=&quot;https://github.com/hardmaru/rnn-tutorial&quot;&gt;model&lt;/a&gt; to preform the handwriting task described in the previous section.  In this section, we will describe how to use this model in Javascript with &lt;a href=&quot;https://p5js.org/&quot;&gt;p5.js&lt;/a&gt;.  Below is the entire &lt;a href=&quot;https://p5js.org/&quot;&gt;p5.js&lt;/a&gt; sketch for handwriting generation.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;pdf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;temperature&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.65&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;screen_width&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;innerWidth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;screen_height&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;innerHeight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;line_color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;restart&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;screen_height&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;random_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;line_color&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;setup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;restart&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;createCanvas&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;screen_width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;screen_height&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;frameRate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;background&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;draw&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;pdf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;get_pdf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;pdf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;stroke&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;line_color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;strokeWeight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;2.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;screen_width&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;restart&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;background&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We will explain how each line works.  First, we will need to define a few variables to keep track of where the pen actually is (&lt;code class=&quot;highlighter-rouge&quot;&gt;x, y&lt;/code&gt;).  Our model will be working with smaller coordinate offsets (&lt;code class=&quot;highlighter-rouge&quot;&gt;dx, dy&lt;/code&gt;) and determine where the pen should go next, and &lt;code class=&quot;highlighter-rouge&quot;&gt;(x, y)&lt;/code&gt; will be the accumulation of &lt;code class=&quot;highlighter-rouge&quot;&gt;(dx, dy)&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// absolute coordinates of where the pen is&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// offsets of the pen strokes, in pixels&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In addition, our pen will not always be touching the paper.  We would need a variable, called &lt;code class=&quot;highlighter-rouge&quot;&gt;pen&lt;/code&gt;, to model this.  If &lt;code class=&quot;highlighter-rouge&quot;&gt;pen&lt;/code&gt; is zero, then our pen is touching the paper at the current time step.  We also need to keep track of the &lt;code class=&quot;highlighter-rouge&quot;&gt;pen&lt;/code&gt; variable at the previous time step, and store this into &lt;code class=&quot;highlighter-rouge&quot;&gt;prev_pen&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;c1&quot;&gt;// keep track of whether pen is touching paper. 0 or 1.&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// pen at the previous timestep&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If we have a list of &lt;code class=&quot;highlighter-rouge&quot;&gt;(dx, dy, pen)&lt;/code&gt; variables generated by our model at every time step, it will be enough for us to use this data to draw out what the model has generated on the screen.  At the beginning, all of these variables (&lt;code class=&quot;highlighter-rouge&quot;&gt;dx, dy, x, y, pen, prev_pen&lt;/code&gt;) will be initialised to zero.&lt;/p&gt;

&lt;p&gt;We will also define some variable objects that will be used by our RNN model:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// store the hidden states the rnn&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// store all the parameters of a mixture-density distribution&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;pdf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// controls the amount of uncertainty of the model&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// the higher the temperature, the more uncertainty.&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;temperature&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.65&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// a non-negative number.&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As described in the previous section, the &lt;code class=&quot;highlighter-rouge&quot;&gt;rnn_state&lt;/code&gt; variable will represent the &lt;em&gt;hidden state&lt;/em&gt; of the RNN.  This variable will hold all the vague ideas about what the RNN &lt;em&gt;thought&lt;/em&gt; it has written in the past.  To update &lt;code class=&quot;highlighter-rouge&quot;&gt;rnn_state&lt;/code&gt;, we will use the &lt;code class=&quot;highlighter-rouge&quot;&gt;update&lt;/code&gt; function in the model later on in the code.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The object &lt;code class=&quot;highlighter-rouge&quot;&gt;rnn_state&lt;/code&gt; will be used to generate the probability distribution of what the model will write next.  That probability distribution will be represented as the object called &lt;code class=&quot;highlighter-rouge&quot;&gt;pdf&lt;/code&gt;.  To obtain the &lt;code class=&quot;highlighter-rouge&quot;&gt;pdf&lt;/code&gt; object from &lt;code class=&quot;highlighter-rouge&quot;&gt;rnn_state&lt;/code&gt;, we will use the &lt;code class=&quot;highlighter-rouge&quot;&gt;get_pdf&lt;/code&gt; function later, like this:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;nx&quot;&gt;pdf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;get_pdf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;An additional variable called &lt;code class=&quot;highlighter-rouge&quot;&gt;temperature&lt;/code&gt; allows us to control how confident or how uncertain we want to make the model.  Combined with &lt;code class=&quot;highlighter-rouge&quot;&gt;pdf&lt;/code&gt; object, we can use the &lt;code class=&quot;highlighter-rouge&quot;&gt;sample&lt;/code&gt; function in the model to sample the next set of &lt;code class=&quot;highlighter-rouge&quot;&gt;(dx, dy, pen)&lt;/code&gt; values from our probability distribution.  We will use the following function later on:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;pdf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The only other variables we need now are to control the colour of the handwriting, and also keep track of the screen’s dimensions of the browser:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;c1&quot;&gt;// stores the browser's dimensions&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;screen_width&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;innerWidth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;screen_height&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;innerHeight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// colour for the handwriting&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;line_color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now we are ready to initialise all these variables we just declared for the actual handwriting generation.  We will create a function called &lt;code class=&quot;highlighter-rouge&quot;&gt;restart&lt;/code&gt; to initialise these variables since we will be reinitialising them many times later.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;kd&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;restart&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// set x to be 50 pixels from the left of the canvas&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// set y somewhere in middle of the canvas&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;screen_height&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// initialize pen's states to zero.&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// note: we draw lines based off previous pen's state&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// randomise the rnn's initial hidden states&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;random_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// randomise colour of line by choosing RGB values&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;line_color&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;After creating the &lt;code class=&quot;highlighter-rouge&quot;&gt;restart&lt;/code&gt; function, we can define the usual &lt;a href=&quot;https://p5js.org/&quot;&gt;p5.js&lt;/a&gt; &lt;code class=&quot;highlighter-rouge&quot;&gt;setup&lt;/code&gt; function to initialise the sketch.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;kd&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;setup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;restart&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// initialize variables for this demo&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;createCanvas&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;screen_width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;screen_height&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;frameRate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// 60 frames per second&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// clear the background to be blank white colour&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;background&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Our handwriting generation will take place in the &lt;code class=&quot;highlighter-rouge&quot;&gt;draw&lt;/code&gt; function of the &lt;a href=&quot;https://p5js.org/&quot;&gt;p5.js&lt;/a&gt; framework.  This function is called 60 times per second.  Each time this function is called, the RNN will draw something on the screen.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-javascript&quot; data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;kd&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;draw&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// using the previous pen states, and hidden state&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// to get next hidden state &lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// get the parameters of the probability distribution&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// from the hidden state&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;pdf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;get_pdf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;rnn_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// sample the next pen's states&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// using our probability distribution and temperature&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;pdf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// only draw on the paper if pen is touching the paper&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// set colour of the line&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;stroke&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;line_color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// set width of the line to 2 pixels&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;strokeWeight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;2.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// draw line connecting prev point to current point.&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// update the absolute coordinates from the offsets&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;dy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// update the previous pen's state&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// to the current one we just sampled&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;prev_pen&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;pen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// if the rnn starts drawing close to the right side&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// of the screen, restart our demo&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;screen_width&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;restart&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// reset screen&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;background&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;255&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;At each frame, the &lt;code class=&quot;highlighter-rouge&quot;&gt;draw&lt;/code&gt; function will update the hidden state of the model based on what it has previously drawn on the screen.  From this hidden state, the model will generate a probability distribution of what will be generated next.  Based on this probability distribution, along with the &lt;code class=&quot;highlighter-rouge&quot;&gt;temperature&lt;/code&gt; parameter, we will randomly sample what action it will take in the form of a new set of &lt;code class=&quot;highlighter-rouge&quot;&gt;(dx, dy, pen)&lt;/code&gt; variables.  Based on this new set of variables, it will draw a line on the screen if the pen was previously touching the paper pad, and update the global location of the pen.  Once the global location of the pen gets close to the right side of the screen, it will reset the sketch and start again.&lt;/p&gt;

&lt;p&gt;Putting all of this together, we get the following handwriting generation sketch.&lt;/p&gt;

&lt;div id=&quot;sketch02&quot;&gt;&lt;/div&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;So there you have it, handwriting generation in your web browser in with a few lines of Javascript using &lt;a href=&quot;https://p5js.org/&quot;&gt;p5.js&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;sampling-from-a-probability-distribution-with-varying-temperature&quot;&gt;Sampling from a Probability Distribution with Varying Temperature&lt;/h3&gt;

&lt;p&gt;The variable &lt;code class=&quot;highlighter-rouge&quot;&gt;pdf&lt;/code&gt; is supposed to store the probability distribution of the next pen stroke at each time step.  Under the hood, the object &lt;code class=&quot;highlighter-rouge&quot;&gt;pdf&lt;/code&gt; actually just contains the parameters of a complicated probability distribution (i.e. the means and the standard deviations of a bunch of Normal Distributions).  We have chosen to model the probability distribution of &lt;code class=&quot;highlighter-rouge&quot;&gt;dx&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;dy&lt;/code&gt; as a &lt;em&gt;Mixture Density Distribution&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;But what exactly is a mixture density distribution?  Well, statisticians (&lt;em&gt;data scientists&lt;/em&gt;) like to model probability distributions with well known, mathematically tractable distributions such as the Normal Distribution, and they try to determine the parameters of the distribution (such as the mean and standard deviation for a Normal Distribution) to best fit the data.  However, when dealing with something complicated, like the strokes of handwriting data, we find that a simple Normal Distribution is not good enough to model the data.  Intuitively, handwriting strokes either stay close to the previous location, or jump to another location when a word or character is finished.&lt;/p&gt;

&lt;p&gt;A straight forward way to deal with this problem is to model a probability distribution as the sum of many Normal distributions added together.  In our case, we model the handwriting strokes as the sum of 20 Normal distributions.  With a Mixture of 20 Normal distributions, our model can do an okay job of modelling the actual handwriting data.  More technical details can be obtained in this other &lt;a href=&quot;https://blog.otoro.net/2015/12/12/handwriting-generation-demo-in-tensorflow/&quot;&gt;post&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;When we take this probability distribution, and sample from this distribution to get the set of &lt;code class=&quot;highlighter-rouge&quot;&gt;(dx, dy, pen)&lt;/code&gt; values to determine what to draw next, we use the &lt;code class=&quot;highlighter-rouge&quot;&gt;temperature&lt;/code&gt; parameter to control the level of uncertainty of the model.  If the temperature parameter is very high, then we are more likely to obtain samples in less probable regions of the probability distribution.  If the temperature parameter is very low, or close to zero, then we will only obtain samples from the most probable parts of the distribution.&lt;/p&gt;

&lt;p&gt;In the sketch below, you can visualise how the probability distribution becomes augmented by varying the temperature parameter.  You can control the temperature parameter by dragging around the top orange bar.&lt;/p&gt;

&lt;div id=&quot;sketch04&quot;&gt;&lt;/div&gt;
&lt;center&gt;&lt;i&gt;Visualise a Mixture Density Distribution by adjusting the Temperature.&lt;/i&gt;&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;For simplicity, the above demo simulates a mixture of twenty, one-dimensional normal distributions with a temperature parameter.  In the handwriting model, the probability distribution is a mixture of twenty, two-dimensional normal distributions.  In the next sketch, you can modify the temperature of the handwriting model while it is writing something, to see how the handwriting changes with varying temperatures.&lt;/p&gt;

&lt;div id=&quot;sketch03&quot;&gt;&lt;/div&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;When the temperature is kept low, the handwriting model becomes very deterministic, so the handwriting is generally more neat and more realistic.  Increasing the temperature will increase the likelihood of choosing less likely probable of the probability distribution, so the handwriting samples will tend to be more funky and uncertain.&lt;/p&gt;

&lt;h2 id=&quot;extending-the-handwriting-demo&quot;&gt;Extending the Handwriting Demo&lt;/h2&gt;

&lt;p&gt;One of the more interesting aspects of combining machine learning with design is to explore the interaction between human and machine.  The typical machine learning framework + python stack makes it difficult to deploy truly interactive web applications, as they often require dedicated web services to be written on the server side to process user interaction on the client side.  The nice thing about Javascript frameworks such as &lt;a href=&quot;https://p5js.org/&quot;&gt;p5.js&lt;/a&gt; is interactive programming can be done with ease, and deployed without much effort in a web browser.&lt;/p&gt;

&lt;div id=&quot;sketch05&quot;&gt;&lt;/div&gt;
&lt;p&gt;&lt;/p&gt;

&lt;p&gt;A possible &lt;a target=&quot;_blank&quot; href=&quot;https://otoro.net/ml/rnn-tutorial/multi.html&quot;&gt;interactive extension&lt;/a&gt; we can build from the basic handwriting demo is to have the user interactively write some handwriting onto the screen, and when the user is idle, have the model continuously predict the rest of the handwriting sample.  Another &lt;a target=&quot;_blank&quot; href=&quot;https://otoro.net/ml/rnn-tutorial/predict.html&quot;&gt;extension&lt;/a&gt; we can build, similar to the one is the &lt;a href=&quot;https://distill.pub/2016/handwriting/&quot;&gt;distill.pub&lt;/a&gt; post, is to have the model sample multiple possible paths that follow the handwriting path created by the user.&lt;/p&gt;

&lt;div id=&quot;sketch06&quot;&gt;&lt;/div&gt;

&lt;p&gt;There are countless other possibilities one can experiment with this model.  It will also be interesting to combine this model with more advanced frameworks such as &lt;a href=&quot;https://paperjs.org/&quot;&gt;paper.js&lt;/a&gt; or &lt;a href=&quot;https://bl.ocks.org/&quot;&gt;d3.js&lt;/a&gt; to generate better looking strokes.&lt;/p&gt;

&lt;h2 id=&quot;use-this-code&quot;&gt;Use this code!&lt;/h2&gt;

&lt;p&gt;If you are an artist or designer interested in machine learning, you can fork the &lt;a href=&quot;https://github.com/hardmaru/rnn-tutorial&quot;&gt;github repository&lt;/a&gt; containing the code used for this post, and use it to your liking.&lt;/p&gt;

&lt;p&gt;This post only scratches the surface of recurrent neural networks.  If you want to be more involved into the whole machine learning development process and train your own models, there are excellent &lt;a href=&quot;https://ml4a.github.io/&quot;&gt;resources&lt;/a&gt; to learn how to build models with &lt;a href=&quot;https://github.com/jtoy/awesome-tensorflow&quot;&gt;TensorFlow&lt;/a&gt;, or &lt;a href=&quot;https://github.com/fchollet/keras-resources&quot;&gt;keras&lt;/a&gt;.  If you use &lt;a href=&quot;https://keras.io&quot;&gt;keras&lt;/a&gt; to build and train your models, there is even a tool called &lt;a href=&quot;https://github.com/transcranial/keras-js&quot;&gt;keras.js&lt;/a&gt; that allow you to export pre-trained models for web browser usage, so you can build model interfaces like the Javascript handwriting model used in this post.  I haven’t personally used &lt;a href=&quot;https://github.com/transcranial/keras-js&quot;&gt;keras.js&lt;/a&gt;, and I found it fun to just write the handwriting model from scratch in Javascript.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This model has already been ported to &lt;a href=&quot;https://bl.ocks.org/dribnet/cd6ee08b7658e5c744307b44b438221f&quot;&gt;bl.ocks&lt;/a&gt;, and extended by a few people to do some &lt;a href=&quot;https://bl.ocks.org/dribnet/8284e82ecefefeb391298356d3ab6732&quot;&gt;very&lt;/a&gt;, &lt;a href=&quot;https://bl.ocks.org/dribnet/f27c6167fcf4157cd0da0d9d5d016aa7&quot;&gt;interesting&lt;/a&gt;, &lt;a href=&quot;https://naoyashiga.github.io/my-dying-message/&quot;&gt;things&lt;/a&gt;.&lt;/p&gt;

&lt;script src=&quot;/js/p5/handwriting/numjs.js&quot;&gt;&lt;/script&gt;

&lt;script src=&quot;/js/p5/handwriting/weights.js&quot;&gt;&lt;/script&gt;

&lt;script src=&quot;/js/p5/handwriting/model.js&quot;&gt;&lt;/script&gt;

&lt;script src=&quot;/js/p5/handwriting/handwriting.js&quot;&gt;&lt;/script&gt;

</description>
        <pubDate>Sun, 01 Jan 2017 00:00:00 -0600</pubDate>
      </item>
    
  </channel>
</rss>
