Data capture for invoices ought to have been solved a long time ago! That's what most people think, especially if they’ve never tried to actually do it.
That's what we thought when we started talking to customers, looking for the ideal application of Rossum's machine vision technology. It is genuinely surprising how hard this problem actually is, and how big an advantage a human mind has compared to a fixed algorithm. That’s also the reason Rossum's approach stands out so much within this domain.
This is a special founder blogpost, in two parts written by the original minds behind Rossum's technology - Petr, Tomas and Tomas. We will walk you through the concrete limitations of the current OCR systems, why we built Rossum, which lets anyone capture data from invoices without manual capture setup, and how it achieves this.
Who are we? Standard nerds, albeit with many big accomplishments between us in machine learning, computer vision, and AI. Just about 2 years ago, we decided it was time to stop fiddling with AlphaGo and image recognition, and focus on one super-hard problem with a real impact on the lives of millions of people every day. Surprisingly enough, it turned out to be invoices. Here’s why:
Traditional Data Capture
Something we found out early about invoices was that the problem of reading invoices has been “solved” for decades, and yet the solutions don’t really work. Implementation is expensive and time-consuming, and even then, systems are prone to error such that they can never be fully automated.
The traditional approach to data capture from documents is to first generate a text layer from the document using an OCR (Optical Character Recognition) step, then recognize data fields using either image-based templates or text-based rules.
So far so good. This can actually work fine for documents that are scanned well and always have the same format without any variability, such as fixed forms or generated reports. The IRS and the postal system have had this technology forever. It can be a bit of a hassle to set up the recognition, but once you see the process through, you would be happy with the recognition reliability.
Rules for Variable Documents are Crazy
And yet… this approach breaks down badly once document variability becomes a factor, such as with invoices from suppliers. You’re likely to find that all your work in recognizing a particular kind of invoice has to be done all over again because of even a slight alteration of the format - alterations which happen constantly across industries.
The naive thing to do is to set up the recognition for each particular format of invoice, and people certainly try. But once you see the variability of invoices, you will realize how much of a losing battle this is.
Consider you’re Acme Technology Inc. You work with maybe 60 suppliers in 10 different countries. 60 companies means up to 60 distinct formats for invoices, any of which can change at any time. 10 different countries also means that there are potentially 10 legal standards that each company’s accounting offices also have to meet, and each of the 60 companies also has their own internal requirements for what invoices must include, and how they are formatted, so you can’t make all your suppliers follow your chosen standard. Some can’t, and others won’t.
Because invoices are often using mixed languages, you need to have OCR that recognizes all of them, and a set of text-based rules that take them into account. If you are using image based templates, then you need to make sure that anything scanned and submitted by a supplier is in the right format, isn’t rotated or blurred,, and nothing has actually changed from the original format your system is designed to detect.
And let us not consider what you do when an invoice contains unusual notations or line items your system isn’t prepared to handle. Also, keep in mind that this is happening with 60 different suppliers at once. Pretty soon the cost and complication of an OCR fails to justify its use. Companies go back to manual entry, or never adopt OCR to begin with.
The problem is hardly solved if no one uses the solution.
You need a lot of rules and templates, and eventually everyone gives up. This means in practice that even when OCR is adopted, only the suppliers producing the most invoices get the recognition implemented. Sometimes that’s 25% of invoice volume, sometimes 60%, but in most cases it falls very short from a complete solution. Moreover, you need to keep worrying as your suppliers change and their invoices change as well.
The problem is still not solved if you have to constantly worry that the solution will break suddenly and without prior warning.
The interesting thing that happens is that even suppliers using the same accounting software to generate invoices must almost always be handled separately - there is so much room for customization when generating invoices, and boy does everyone sure use that to its full potential. The crazy thing that doesn’t happen is that data capture users aren’t banding up to share supplier templates and examples with each other, so every project is done internally, from scratch. Maybe someone already did set up the rules for invoices from AWS or your plumber, but you will need to do it again anyway.
If you’re like us, coming from an Artificial Intelligence background, and not from within the traditional data capture mindset, then what we just described sounds pretty crazy and backwards. Yet this is the current situation, like it or not.
For large corporations, the better is often the enemy of the good. The bigger the organization, the higher the risk associated with changing these approaches. Because no solution has yet broken down the brick wall that OCR and dumb algorithms hit many years ago, there is little sense for large enterprises to invest heavily in improving their systems for tiny marginal improvements. If they throw away their old approach, it has to be really worth it. Effectiveness and reliability have to go from around 80% to 98% or more. A quantum leap is needed. That’s what we at Rossum quickly realized, and it is what we are now delivering to our clients.
Standard OCR is Just Not Good Enough - And Never Will Be
Let’s assume we go crazy and try the traditional recognition setup.
The trouble with image-based templates is that they are very sensitive to the scanning process, with zero flexibility in regards to document variability. Image based templates are the simplest possible approach, they’re effectively just saying: “this field is in this exact position on every document, or nothing works.”
Text-based rules give an initial illusion of flexibility - just bind a data field type to the obvious label phrasings. Surely there can’t be but a few ways of doing this. And thus have many engineers failed to appreciate a user’s ability to destroy our carefully laid plans.
Try to figure out good flexible rules for invoiced amounts, as shown in the images below - go ahead, we’ll wait.
Besides covering all the different phrasings, the non-obvious caveat is the false positives you could get - rules that are too universal would eagerly match at all sorts of wrong places and capture a different kind of information. Sub-Totals become Totals. Shipping becomes VAT. Dogs living with cats, mass hysteria!
This is why in the end, with sorrow in your defeat, you will restrict even text-based rules to each respective supplier. A fancy term to give this a guise of sophistication is “fingerprinting”.
The other problem with text-based rules is… that just like image-based templates, they are also sensitive to the scanning process! The rules match concrete text strings, and OCR is a noisy process. The hard truth is that OCR was originally developed for digitizing books and newspapers, and applying the exact same technology on business documents leads to all sorts of results, often ranging from interesting to funny (in a sad way).
First, OCR needs to detect that a text is present at a particular position at all. Invoices have an extremely complicated layout, and detecting all pieces of text on the page reliably may be a challenge the moment a slightest problem appears, ranging from too small font to a smear or an overlapping stamp.
Second, OCR makes letter-by-letter mistakes, especially in an unfamiliar setting. And “unfamiliar setting” may just mean reading a credit note rather than Shakespeare or a biology textbook. There aren’t many street names or dollar amounts in Shakespeare sonnets to be recognized, and the results follow correspondingly (i.e., not great).
Traditional OCR provider:
And because OCR is noisy, text rules can still fail - it is not just about transcribing the value of a field, but also precisely transcribing the whole text label of the field, so that your painstakingly built text rules have a chance to match.
The practical implications? The painstaking text rules implementation just gets more painstaking as variations covering the most common OCR mistakes start popping up in the rules. Our newest customer maintained 13,000 lines of rules amassed over a few years of operation. Budgets are destroyed, timelines the same, and still you’re stuck with less than okay results.
This is the way it works, it’s just the way it is. Just as travelling between Europe and the Americas would take weeks by ship before air travel, you would run only a single program at once in DOS. The world is using a technology, OCR, which was designed to digitize texts that were printed out in books, by professionals. It’s like trying to find things on the internet by guessing and checking domain names - which is actually something people used to have to do.
But with the recent advances in Artificial Intelligence, we can be more daring. Rather than look at the current fixed algorithms and fine-tune them further, we can take a step back and look from the opposite, radical perspective: why are computers so bad at this when humans can do the task really well?
In the last 5 years, thinking about how humans find information to teach computers the same approach suddenly isn’t a crazy notion - instead, it has become a proven strategy that works for automating routine tasks thanks to neural networks, deep learning and big datasets.
Self-Teaching: Humans Don’t Need Templates
Traditional OCR software that was built to digitize books and articles takes a completely sequential approach to reading a page - it just starts at the top left corner and goes line by line all the way down to the bottom right corner (or the opposite for Chinese or Hebrew). That’s fine - humans do the same thing when reading an article, more or less.
And yet a human can also look at a page of text and instantly understand *what it is.* A human can skim, and find specific information without reading everything. I know the difference between a string of random characters and the opening of Moby Dick. OCR doesn’t know that. To OCR, it’s just “Call me 1shm4€l.”
When we move to business documents, we see even more why that skill of self-teaching and self-directing one’s attention is so important. Humans do not read such documents from the beginning to the end, rather they are just looking for specific bits of information, skimming the invoice, darting their eyes back and forth and looking for key points. Reading the document precisely and letter by letter just isn’t necessary, and humans can go long way based just on a visual structure of the document - try it yourself:
But this is very much not how the traditional OCR software is reading business documents, it is reading them just like books. If they are matching rules to capture data, precise letter by letter reading is essential - “T0ta1 amourt” will just not match a rule that looks for the text “Total amount.” A human doesn’t care what letter it is, they just see whether this is the value they are looking for, and record it and move on without a second thought.
This is because humans are very good at self-teaching, developing what we know as “intuition,” the sum of all we know about our experiences that leads us to “know” things even when we don’t consciously know them. We can easily adjust our understanding of the whole document based on the fragments we skim through, and then decide to go back and look for other relevant information based on what we see. You could even derive from the usage of certain words or phrases, based on the context, what they mean and how they function. No traditional machine can do this or anything like it.
If I showed you a brand new format of invoice you’d never seen before, even in a language you aren’t familiar with, you would still be able, with no other outside inputs, to quickly analyze and digest the important information. You could teach yourself how to read it. Humans are just really good at that.
Next: The Rossum Approach
We looked at the fundamental limits of the traditional OCR systems regarding information extraction from business documents, and discussed how hard it is to push the accuracy further with this approach. At the same time, we saw how much more sense the human-like approach to data capture is, so the obvious question is whether we can replicate it with modern deep learning technology. And the answer is - yes! This is all that Rossum’s technology stack is about. We will delve into the details of that technology in the second half of this series, stay tuned.