Founder Interview with Chief AI Architect Petr Baudiš
In our second Founder Interview blog, I sit down with Petr Baudiš, Chief AI Architect of Rossum, on our terrace on a surprisingly warm late autumn afternoon. We discuss growing up programming, starting Rossum and what new technologies Rossum has been developing. Named one of Financial Times’ New Europe Challenger in 2017, Petr has published many scientific papers which have even been referenced in AlphaGo’s Nature paper. Petr is a disruptor in the world of AI and his work at Rossum is doing just that. Read on and you’ll see why.
OG: Hi Petr, thanks for taking the time to sit down with me. Can you start by elaborating on your background? How long have you been programming?
PB: I have a pretty long background in programming. I’ve been always programming, since I remember pretty much, from about 7 years old. I did a lot of open source programming, and that is how I essentially learned to program. At 15, I started maintaining my first open source project, called ELinks, that was a web browser, but it wasn’t running in a graphical console with pictures, it was running in a text terminal. So if you connect to a server, and you have just this text connection where you are entering comments, this browser enabled you to still browse web pages. That was my first open source project that I did and since then I did a lot of other open source work.
The most notable work I did in this area was with Linus Torvalds. He decided that for the Linux kernel he needed a better way to track how people contributed to the kernel, track the code changes, and how those changes propagated among all those people people working together on a single source code. So, he went off and came up with Git, and essentially the same day that he announced Git on the Linux kernel mailing list, I noticed that email, and I sent my first contribution to Git. For a long time after, I was part of the core development team at Git.
I was always interested in Artificial Intelligence. While I was studying at University, I picked up the board game, Go – the black and white stones, on the grid, capturing territory. When I like some game, a computer game or a board game, I eventually start programming it. If it’s a computer game, I start sending patches and building stuff on top of the game. I started building an artificial intelligence to play Go. For a couple of years it was the best artificial intelligence for the game that you could get that was open source. There were some commercial programs that you had to buy that were a little stronger than mine, but if you just wanted to download something for free, or experiment with it, my program was the one you would be using. This program ended up being one of the predecessors for Google Deep Mind’s work on AlphaGo, it was the benchmark they used in their Nature paper to measure how good AlphaGo was, version by version as they iterated.
OG: How did you become Co-Founder & Chief AI Architect of Rossum?
PB: I started my PhD and moved further into the field of Artificial Intelligence. I was thinking if I should continue working on Go, but I didn’t immediately see the tricks that Deep Mind employed, so I didn’t see the way that I could make it easily so strong, or if I should do something else in the area. I was really interested in deep learning and neural networks, and I started to do a lot of experiments with that. My involvement in AI has become all about knowledge essentially. How could we build the most helpful, strongest AI there is? A huge barrier is that humans have a lot of knowledge. Humans connect pieces of knowledge together in their brain – I wanted to see how we can replicate this in a computer.
How do we learn knowledge? We have some common sense knowledge like ‘things fall down’ and ‘you can sit in a chair’. It might be nice if you are building a robot to encode this knowledge but it is far away from working with this knowledge to build anything useful with AI. I was thinking, what kind of knowledge, and how should I use it to make something that is useful in the short term. My way of thinking was, I should be building something useful in the short term, but also that I can improve in the long term. In the short term, if it is useful, it can make some money to build it further.
Meanwhile, the other kind of knowledge is factual knowledge – like ‘What is the highest mountain in the world?’ or ‘What kind of protein is effective to combat some antibodies in some cell?’ It can be something very factoid, but it can be also very domain specific, like a scientific fact. I thought that working with this factual knowledge could be a lot more useful short-term and you may realize that actually, all of this knowledge is stored in documents. So I focused very closely on documents – how knowledge is stored and how we can extract knowledge from documents. This was the general idea, but of course I wanted to build something practical.
I knew about IBM Watson, before it was this whole AI initiative, it used to be a Jeopardy player. Jeopardy is this television game where you have those factoid questions and you need to super quickly react to them and tell the correct answer. IBM Watson has beaten some human champions in this game and IBM published a whole single issue of a scientific journal which was packed with scientific papers describing the individual components of IBM Watson. I decided, let’s build our own open source version of IBM Watson, that project was called YodaQA. It started as more of an engineering project than as a scientific project, but I really focused on the ‘documents’ part and it turned out that I really needed to figure out a way to understand sentences and what the sentences meant in order to make meaningful progress. I started to dive into deep neural networks for understanding this text and sentences and their meaning. I published some papers and for a tiny slot of time, Facebook was mentioning my neural networks as a result that beats them on some of the benchmarks. Of course, they improved quickly to overcome that again, but that’s how it goes in the scientific world.
At the same time, in the predoc office at our University, I had met two other great guys who were also doing their PhD on something having to do with documents [Rossum’s other Co-Founders Tomas Tunys and Tomas Gogar]. We talked about what we were doing and how to do it better, and what the long term plans should be for our dissertations. Eventually, we decided we really wanted to build something together.
OG: What was it like when you were thinking of starting a company like Rossum?
PB: We wanted to build something advanced, really bleeding-edge. We were thinking, let’s put everything that we have learned so far, everything we have been doing on our PhDs, let’s put all that together and find the most advanced product, the most advanced technology that we can build using that.
That is how we came up with Rossum actually, we had this great research which was mainly done by my colleague, Tomas, at that time, and he figured out a way for neural networks to perceive documents in a new way rather than a string of text, he was trying to figure out how to represent layout and how could the networks perceive the layout. There were some pretty neat ideas that he started with, and we collectively improved on them. I built the first implementation of the machine learning pipeline that processes documents, and did some first extraction from documents. Roughly at the same time, we were going through Startup Yard, and we were discussing with a lot of great business people what could be great applications of this neat new way to extract information from documents using layouts. And already one of the ideas that was hanging on our post-it wall in our old office for forever read ‘invoices’.
We thought with invoices, it sounded like such a huge problem, that surely someone had built a product to solve it, but as it turned out during the business discussions, no one really had. There were, of course, plenty of products, but from our perspective they were all wrong and didn’t really work the way we thought they should. They required a long setup, and were built with what we regarded as essentially obsolete technology, and were not drawing on the state of art artificial intelligence, which is the area where we lived during our PhDs. When you are studying a PhD, you are doing your dissertation in a field, you need to explore the boundary, the frontier; we are very familiar with that. We asked, what is the most state of the art approach we can leverage here? That is how Rossum’s technology, and also how the product came to be.
OG: How would you describe Rossum’s AI and the unique features it offers?
PB: Our layout approach is something extremely unique. Most other approaches look at a page, convert that page to text, then try to find out some information, but we do it the other way. We first look at the page and figure out in which areas of the page is the information we are looking for, where is it likely to occur, and only then do we focus down on that area and convert it to text and carefully read it letter by letter.
In artificial intelligence you have two big fields – they aren’t the only fields, but they are some of the most popular ones. One is natural language processing (NLP), which is all about the text, understanding the text, working with sentences, and finding information in sentences. It’s all about text, like sequences of words, sentences and paragraphs, and free text long documents.
The other area is computer vision (CV), which is something you would use in self-driving cars, for instance. You have some cameras, and the cameras would perceive what the neighborhood of the car is. In the pictures, there are a lot of individual pixels, and in a computer you need a way to connect those pixels together to figure out the shapes and what the shapes mean. It may mean that there is some object present there, maybe in the way of the car, and that is what computer vision is all about. What we did was unusual in the standpoint that we said we have this document, which sort of implies natural language processing and all those algorithms, and we said, if I look at the document like a human, I am not reading this document letter by letter, from one corner to the other. If it is a book, or newspaper, I pretty much read it sequentially, but if it is a business document like an invoice, I do something completely different. I am skimming over the page with my eyes, and I am definitely not reading anything consciously. If I need to know the amount, I intuitively zoom in to the area where the amount is. In my head I am not doing any sentence understanding, what I am doing is employing my visual cortex, employing the same neurons in my head that I would when I am looking outside at buildings and the streets – I am using computer vision to find the information.
The interesting thing about artificial intelligence, in recent years, you have all of these buzzwords like ‘deep learning’ and ‘neural networks’ and what they really mean are that we have figured out a way to mimic the human approach to common problems.
In the past, if you were building something that solves a problem, typically your algorithm, how you would do this in a computer, is very different to how a human would do it in their mind. In deep learning, what we can do is actively think about how a human is doing a process, what are they thinking when they are doing this, and getting inspired from that to implement those algorithms with the neural networks. That is essentially what we did for information extraction from documents.
OG: What are some of the biggest challenges that you have come across so far?
PB: There are a lot of challenges, of course. For all engineers, it is very easy to program and build the technology – that is what we understand, that is what we are comfortable with. But setting up a successful business will always be hard work. Coming back to Rossum’s early beginnings, I think we also had a huge advantage in a sense, that we were three engineers starting a business, we had no preconceptions about the data capture market. We could go in and disrupt it completely, because we could work out the ideal product from the most basic principles, no historical baggage to carry.
I think for a lot of AI startups, the main technical challenge is data. I resonated with a presentation by Andrej Karpathy, the Director of AI for Tesla, who was speaking about the struggles Tesla faces, which overlaps Rossum’s own, in getting training data and working with said data to teach the neural networks to do a task correctly. In our case, we first simply wanted to find as many invoices as possible and put them in a single pile. But that is just the first 20% of what it means to “get data”.
For teaching the neural networks, you also need to give the neural network the correct answer – much like children learning to solve basic math problems, in the end you need to know if you got it wrong or right. We also need this for the neural networks. That means not just having a huge pile of invoices, but also having exact labels of what kind of information is on the invoices. When you start working on that, then you really go into the rabbit hole.
It is difficult to figure out the universal format for invoice exchanges because there are so many specific things that vary from invoice to invoice and supplier to supplier. This is why the data is hard to get for training the neural network, because when you are labeling the data you are hitting ambiguities all the time. To boot, in the case of business documents, the data is way more complex than your typical machine learning datasets, and you need annotators with a lot of expertise and a very robust process to catch mistakes. I’m very proud of the annotation team we have built, it took a long time to learn how to do it correctly but now this expertise is priceless.
OG: Do you have any predictions you want to share?
PB: It’s a cliché, but everyone tells you that predicting the future is hard, and it is. For example, 10 years ago, everyone predicted that no computer would be strong enough to beat a human at Go by now, but – surprise – there is AlphaGo, and it works great. The methods of AlphaGo are enough to solve this problem. It is always a surprise that things work better and faster than anyone thought. This is why it is really hard to predict what the changes will be, because then again, we expected that we will have solved a lot of AI challenges long ago and that has not happened yet.
One of the biggest challenges in Artificial Intelligence is still knowledge – what I set out to do a couple of years ago. I think that very little progress has been done in full, but we have done a lot of progress on the sentence understanding part of the problem. AI can understand a lot of the intent when asking questions, but it is really hard when you have a question which asks about some fact you can’t answer using super basic common sense. For instance, AI can understand if you ask what the highest mountain in the world is, but to build an AI which understands a more complex question and then needs to go to a database to put the results together, there is almost no progress in this direction so far. This means that leveraging knowledge for AI, unless it is a very simple leveraging, like a lookup of simple facts – anything more complicated is still a huge challenge.
OG: What do you see for the future of Rossum?
PB: At Rossum, the goal is to eliminate manual data entry from the world, because it is the short term way to make the world better and to leverage our technology. We really do want to make the world better. What we need to reach this goal is to understand documents, even complicated documents, and save people this work. In the long term, we need to think about how we can leverage this understanding of documents to move back to the knowledge area, how to extract the complex knowledge and how to work with it further. We need to first solve the manual data entry problem, and learn how to reliably extract knowledge from all kinds of documents.
OG: Thank you, Petr, for taking the time to sit down and chat with me. I know everyone can’t wait to see what Rossum comes up with next!