Building Our Own Version of AlphaGo Zero

At Rossum, we are building artificial intelligence for understanding of documents. Our main line of attack lies in supervised machine learning, the most efficient approach to make neural networks achieve the highest accuracy. However, we need very detailed training data in this setup and that’s why are also pursuing less direct approaches, e.g. based on generative-adversarial networks or advanced reinforcement learning.

Petr Baudiš’ Story: From Go to Neural Networks And Unexpected Reunion

Years back, I have been focusing my scientific efforts in the field of reinforcement learning myself, in particular in the area of making computers play the board game of Go. I got captivated by the game just before, and it was considered one of the toughest challenges in AI that we could tackle successfully at the time.

I have written the then-strongest open source program Pachi, and later followed up with an educational program Michi (the essential state-of-art algorithms in 550 lines of Python code). A few months later, Google’s DeepMind announced several big breakthroughs in the application of neural networks to the game with their program AlphaGo; meanwhile, I moved on to neural network research in the area of natural language processing.

DeepMind stirred the Artificial Intelligence community again just a month ago when the team announced a new version, AlphaGo Zero — extraordinary due to the fact that this time, their neural networks were able to learn Go completely from scratch with no human knowledge (supervised training or hand-crafted features), while it actually required much less computations than before.

“Mastering the game of Go without human knowledge”
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. (…) Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

I have read the Nature paper the same evening it was published and I was excited! Combining my Computer Go experience and later work in neural networks, it seemed it wouldn’t take a lot of effort to recapture the principle of AlphaGo Zero — the basis of the algorithm was much simpler than the original AlphaGo, a single elegant network trained in a loop with no extra heuristics to worry about.

Well, when I went to bed at 5am the same night, the first version of a new Go program Nochi was grinding games in Rossum’s GPU cluster.

DeepMind’s Story: From AlphaGo to AlphaGo Zero

The basis of AlphaGo Zero is simple: a single neural network that simultaneously evaluates positions and suggests followup moves to explore, and the classic Monte Carlo Tree Search algorithm to build the game move tree, explore followups and find move counters — only in this case, it uses just the neural network instead of playing many random game simulations (which was the case for all previous strong Go programs).

Then just start with a completely random neural network that predicts only pure chaos and play many games against itself in a loop, again and again. The neural network is trained on the fly based on which situations it predicted correctly and when it guessed wrong: a reinforcement learning policy is built up. Over time, order emerges from chaos and the program plays better by each hour spent in Google’s TPU cluster.

It’s pretty amazing that learning to play Go from “the first principles” to super-human level was actually much quicker than it took the original AlphaGo. It’s also surprising that the strategies discovered are actually very similar to what humans developed over thousands of years — we were on the right track!

When seeing how AlphaGo compares with AlphaGo Zero, it’s easy to pinpoint three main advancements that contributed to AlphaGo Zero:

  • Not basing the training on game records of human games.
  • A single simple neural network replacing the complex interlock of two neural networks used in the original AlphaGo.
  • Residual Units (ResNet-like) in the convolutional neural network that is used for Go board evaluation.
Deep Residual Learning for Image Recognition (arxiv)

The last point draws a general recommendation: If you are using pre-ResNet convolutional neural networks for visual tasks, consider upgrading if accuracy matters! At Rossum, we have consistently seen an uptick in accuracy in all tasks where we did this, the same the AlphaGo team discovered.

Rossum’s Go Program: Nochi

My small Python program Michi contained an implementation of the Go rules, the Monte Carlo Tree Search algorithm, and used randomized game simulations for the evaluation. This was ideal —it’s enough to just replace the randomized game simulations with a Keras-based neural network and add a “self-play” training loop to the program. And thus Nochi was born. (Of course, while it took one night to implement it, that’s not to say we haven’t been debugging and tweaking it over the next weeks…)

But there’s a catch. AlphaGo Zero is much less demanding than old Alphago, but running the same setup would still take 1700 GPU-years with ordinary hardware. (Think about Google’s computational capabilities and what they achieved with their Tensor Processing Units for a moment!)

Therefore, we made the setup easier on us — instead of the full-scale 19×19 board, we train Nochi only on 7×7, the smallest sensible board. We also made tweaks to the original approach — a slightly different neural network architecture based on our experience with what works best at Rossum, and a much more aggressive training curriculum that makes sure no position seen during the self-played games goes to waste and the neural network converges as soon as possible.

This is the setup where Nochi was the first AlphaGo Zero replication that achieved the level of the GNU Go baseline. (GNU Go is a classical intermediate-level program that’s popular for benchmarking other algorithms.) Moreover, Nochi’s level improves with allocated time per move, which suggests that the neural network didn’t just memorize games but learned to generalize and figure out abstract tactics and strategy. And the best part? Nochi is open source on GitHub, and still just a tiny Python program that anyone can learn from.

Several other efforts to replicate the success of AlphaGo Zero are now underway — e.g. Leela Zero and Odin Zero. After all, the world still needs a super-human Go playing software that anyone can install and learn from! But we will be rooting from the sidelines — our full attention again belongs to documents and our vision of eliminating all manual data entry from the world.

Learn More

Standard

Leave a Reply