My two favorite subjects are computational art and machine learning. Recently I am have been spending a lot of time rebuilding images and scenes in Processing, a programming language for graphics creativity. It was a lot of tedious work and so I wanted to see if I could train a computer to learn how to write Processing code to generate images. I decided to do this project using Deep Learning via the pytorch framework.
Deep Learning accelerates computer vision, especially image analysis and recently image generation, but not a lot of work has been applied to generating working source code.
There are a few papers where people tried to train end to end deep networks to generate code, but they have all had bad results. Mostly because they made the problem domain too big.
My first step was to severely constrain the problem. Instead of training a network to write source code to do anything, we only train the model to write source code to draw pictures.
Using images as the output makes it possible for our model to compute loss and be able to learn.
When I first started out this project, I had grand visions of what the model could do, but as I worked longer on it, I realized I had to take out many parts such as teach the model how to create its own variables. My hope is that this work can act as a stepping stone to teach machine learning algorithms to learn how to do generalized coding, not just graphics programming. My goal was to see if I could get the model to generate simple code that could generate graphics that look like South Park.
For any deep learning system to learn, the system must be differentiable, you must be able to calculate the gradient through the network can adjust weights via gradient descent. So how do you model the problem in the way that is differentiable and so the model can learn the coorelation between the code it produces and the result of its program? I had seen a model called show and tell (im2text/neural captioning) that given an image can generate a description of the image. So I decided to start experimenting with that. The main issue I saw in using this kind of model is that it compared generated text with the text from the training model. I want my model to be optimized by comparing generated image versus the original image, the closer the generated image, the better the model.
I took 200k pairs of images and source code, and pass it through a CNN classifier to extract the high level features of each image. I then feed those features into a RNN model along with the source code so the RNN model can learn how to generate code. I used a programming language called Processing, specifically a dialect called Propane using jruby. I did not write my own DSL or layer in between the model and the raw programming language,the model is able to generate code that I can immediately feed into the jruby parser. One of the main reasons I chose to use jruby insted of pure Java is that I wanted the model to learn the simplest grammar possible. Java has types, a convoluted syntax, and is very verbose. Ruby has complex syntax as well, but I used only basic syntax for the model to learn. Ruby doesn’t need to declare types. This model could be trained with any programming language, an earlier version I had used cloujre/lisp instead.
That given, I do want to get this whole system running in lisp again because it is closer to the abstract syntax tree that the computer reads anyway.
For the model to learn, I compared the generated source code with my supplied code similar to the paper, but I used cross entropy loss instead. This means that the model is learning the exact order of the source code I provided. That also means you need to write the image source code in a deterministic fashion. Some images might be coded from top left to bottom right or from center to outwards. And how do you deal with overlapping shapes? But images can be generated in many different ways and even the same image can be created in the same way, you could use the point function, or line function, or rect function to create essentially the same images.
Below are some results, for simple images it starts to resemble the image, but for more complicated images, it is totally off.
background(74,161,184) no_stroke triangle(w(0.03),h(0.01),w(0.95),h(0.04),w(0.72),h(0.01)) fill(20,170,191) rect(w(0.66),h(0.06),w(0.72),w(0.72)) triangle(w(0.95),h(0.98),w(0.55),h(0.99),w(0.96),h(0.99))
background(117,13,251) no_fill
fill(13,0,226) line(w(0.99),h(0.09),w(0.96),h(0.96)) line(w(0.12),h(0.02),w(0.96),h(0.96)) fill(161,2,103) line(w(0.14),h(0.62),w(0.86),h(0.03)) fill(11,251,119) line(w(0.77),h(0.4),w(0.64),h(0.85)) fill(161,2,67) line(w(0.14),h(0.62),w(0.73),h(0.06)) fill(11,251,148) line(w(0.84),h(0.59),w(0.2),h(0.81)) fill(248,2,1) line(w(0.14),h(0.76),w(0.41),h(0.2)) fill(248,2,1) line(w(0.14),h(0.76),w(0.41),h(0.59)) fill(114,229,158) line(w(0.84),h(0.96),w(0.73),h(0.92))
As I mentioned earlier, I wanted to compare image loss. I tried taking my RNN output and rendering the image so it could be compared to the source images. Thats when I realized that I would “block” the gradient. It gets “blocked” because the image is rendered in a seperate process from the model training and so after you generate the image you cant just insert it back into the training process because there is no way to derive the gradient. So then I decided to try a completely different architect. I tried to train a model that given source code could generate the image. If that worked, I ould then use this model as a “compiler” and as a final layer on my code generator, so train 2 networks on the same training data with different oibjective functions sort of like creating a generative adverserial network, but for code generation. I got a model working using pixel euclidian distance of the images. The model was generating mostly garbage although the loss was extremely low (0.005) , but the images are interesting. I think its because of the way I have been using RNNs.
So I think plain RNNs in this case are not good enough. I know a completely new architecture needs to be used. I saw a couple of new papers recently come out where they tried similar tests and ran into the same issues. They came up with completely different architects to go around the problem, skipping RNNs and generating images iteratively in a reinforcement learning style.
I ended up training on a few other architectures, but they had worse results, so I’m not going to write them up.
While working on this project, it got me thinking as to why do we wrote code the way we do. When I write programs, I usually rewrite them several times. I start off with a basic, ugly, unoptimized system, and I constantly refactor it until I am satisfied with a combination of elegance, understanding, and performance. And the process of writing code is a journey. Each version is written the way it is because as I’m writing it, im obtaining new knowledge that allows me to simplify the code. If I were to see the “best” version of a piece of code before I had written all the other versions, I might now really like this “best” version. In my software, I’m always looking for the right level of abstraction, it is very subjective and everyone writes their code differently. Computers on the other hand are usually working towards one objective, whether that be length of code, clearness, verbosity, or whatever. So when training these models, they will only output one style based off their training data. It would be cool to experiment on “style transfer” but for coding style. It would be interesting to see if it is possible to teach a computer to write code the way we do.
I see all kinds of applications for this kind of technology such as helping developers get out prototypes faster, teaching developers how to code graphics, all kinds of interact media,etc.
I would love to continue on this and I plan to, I need to come up with a completely new architecture and I would rather work on this with someone else.
I ran into lots of deadends testing out new architectures and components.
Designing brand new architects is still difficult these days, there are a lot of components,layers, and parameters to tune. Thats why they say networks are often designed with graduate student descent, or tons of cheap grad students.
There are so many things I could potentially test, attention networks, spatial transform networks, bidirectional RNNs, deconvolution layers,etc. I’m most interested in trying out some form of reinforcement learning.
Also for most of the project, I did all the work on my own , I think its much fun and saner if you can partner with someone on the project so you can at least bounce ideas off eachother.
Here is the source code of the current work
If any of this is interesting, please reach out.