Hear from Rossum’s Co-founder and Chief AI Architect Petr Baudis about our motivation behind the Rossum technology.
This is the second part of our special founders blog post on data capture technology and how Rossum represents a radically different approach to the whole problem of information extraction from business documents.
Last time we covered the fundamental issues of the traditional OCR systems due to their machine-like approach, in contrast with the magical efficiency of human mind in this task. Now, it’s time to delve into how exactly we replicated the human approach using deep learning, and show that it certainly is delivering as promised.
The Rossum approach: Imitating humans
Let’s take the same journey a business document travels in its own lifetime. From an idea in somebody’s head, to an idea in somebody else’s, and from one table of figures at one company, into another table of figures at another company.
When we write down a bunch of ideas and then send those ideas to another person for them to read and act upon, we format this data only to the extent that the formatting helps us to convey the key ideas. For example, a shopping list. We need not include orders to buy every item individually and in order (“Buy lettuce, then buy carrots”), nor the specific sizes or brands or locations in the store, nor the store itself. We may even convey complex ideas in shorthand: “salad fixings,” or “cheeses.” Items can be “fancy,” or “the cheap one.”
The person on the other end of this communication can easily pick up and sort out this very simple data set and construct around it many actions which must be taken to react to it. The human mind does this without much effort, and in this way we accomplish amazing coordination of tasks with few problems.
Machines… not so much.
Business information, such as that in an invoice, usually begins life in a table of figures, with notes or descriptions of what the figures are. That table is then used to generate a document, where the figure is put together with other figures and data from other tables, based on a very complex and verbose set of rules, and sent off to the recipient. From there, all the gathered bits of data must somehow make their way into a new set of structured tables for the purposes of the new organization that has received the data.
All the rules that went into constructing the document must be understood in advance by the recipient, or else nothing will be understood. Any deviation from this, even a comma in the wrong place, can break this intricate system and produce nothing but error messages. There is no room for improvisation and also no ability to recognize subtlety or consider alternatives.
What should be obvious on comparing these two processes is that data transmission between highly automated businesses creates a very tight bottleneck: data can only pass between them in a very specific and predictable way. No emojis. No “you’ll figure it out.”
What we’ve found out in the process of designing a new approach to business information is that humans process such business documents very differently from traditional OCR systems – instead of reading them from the beginning to end, letter by letter, they find information by looking at the document as a whole, taking in the layout, and only focusing on the key locations containing the important data.
We have broken down this approach into three major steps, which we imitate in our neural networks: 1) SKIM-READING 2) DATA LOCALIZATION, 3) PRECISE READING.
When seeing a document page for the first time, the initial take-in represents a sudden rush of information on a very rough level. Humans call this a “glimpse,” and we use it to sort out patterns and potential meanings all the time. Its contiguous with your ability to recognize a firetruck driving down a street, without analyzing any specific qualities that make it a firetruck. It just has “firetruckness.”
We initially notice the layout, but also the general structure of letters at various points of the page – where the amounts and dates are, which texts are large and small, if there is a number in a top right corner of the page, etc. We recognize “invoiceness.”
Next, to find out which amount is which and what dates to look for, we glance ever so briefly at the text labels next to the values to get the rough idea.
The British Dictionary offers a perfect word for this:
To replicate this kind of skim-reading within machine data capture, we use a very special kind of spatial OCR. Instead of converting the page to a stream of text, we determine the letter present on each individual spot of a page, building a space map of where each letter is. We can still recognize words by looking at places adjacent to each other, yet layout is the primary subject of representation rather than an afterthought.
Imagine words cut out of newspapers and magazines in the classic film ransom note. We find this easy to read, but more importantly, the visual grammar communicates something of the purpose of the note even without reading it.
In visual representations for humans, this effect is often used to communicate information that is non-textual. For example, when you see a rush of data about a specific person race across a computer screen in a spy thriller, just giving you the general idea of what’s there in the database, but not so that you can really read it.
2) DATA LOCALIZATION
Numerous studies have analyzed how humans read structured documents by tracking eye focus and movements. An eyetracker can record what part of a page the reader sets their gaze on for how long, nicely documenting the human information intake process.
For example, researchers at the Chia-Yi University investigated the learning process when reading a textbook text accompanied by diagrams (paper; the figure on the left), while UX designers at TheLadders looked into how humans look at CVs (paper; the figure on the right).
The first impression is obvious – humans read structured documents in a very non-linear way, darting their eyes all around and focusing only on specific spots which they intuit would contain the key information. This is perhaps disturbing for anyone who would expect that a potential employer might carefully read their entire CV. Clearly that is not the case!
We took the same principle and built our technology around that – much like humans, our neural networks look at a document page holistically and using this global, even if rough view to identify label candidates (based on just layout, or short text areas corresponding to various data labels). These are the places for our AI to focus on and read carefully, and each of them is already associated with the type of information that could be found there.
Much like a human might check several possible places for a single piece of information, Rossum’s neural networks are eager to over-generate candidates – then they read them carefully and throw away the false ones when more information is available, based on confidence scores. This is why Rossum misses information so rarely.
An interesting phenomenon is worth noting – traditional rules are typically very template-specific, which is understandable if you think back to the variety of just how total amounts can be encoded in the first post of this series. The neural localization technology automatically learns something comparable to the traditional rules (albeit more flexible and scan-robust). But because it is based on deep learning (Haven’t heard about Deep Learning? Read more e.g. at the MIT Technology Review), it also acquires the common property of deep learning algorithms.
Image recognition in self-driving cars is able to recognize even car models it has never seen before, because it generalizes from many examples of previously seen cars. Speech recognition from Google can understand even a voice it has never heard before, because it learns the general form of human speech. And just like that, our field localization does not simply memorize all invoice templates it has seen before – it acquires a general understanding of a typical invoice structure, and therefore can detect most information even in invoices unlike any it has ever seen before (as long as they are not totally dissimilar to what we are used to).
3) PRECISE READING
Whereas the technical details of our unique skim-reading and localization approach represent the essence of Rossum’s technological secret sauce, the final capture step gives it the required robustness to seal the deal as today’s top data capture system.
When we have identified multiple focal points on the page, it is time to precisely read and evaluate the information. For that, we increase the precision of character location, carefully transcribe the final text, and assign it a confidence score.
If you were to draw a box around a piece of information, your eyes would at first glance at the middle of that field, then work out exactly where its boundaries are so that you can determine the precise edges. It should come as no surprise anymore that Rossum works the same way, with a dedicated neural network to determine these boundaries from the focal points.
Fig.: An isolated data field at the top, and raw output from our precise OCR at the bottom – the height of the yellow dots corresponds to a particular letter found at that spot.
Only at this point, with the precise area of the field already determined, we come to the classic OCR task of precisely transcribing linear text from the picture! But at this point, we have much easier time than traditional OCR systems. First, we know where to look, so we do not need to detect chunks of text in the page picture (a highly error-prone task by itself). Second, we already know what type of field are we reading, which makes a huge difference. Just as it is hard to transcribe human speech if you do not know the vocabulary, it is hard on bad scans to tell letters from each other if you do not whether you are looking at a numeric field or a piece of text.
Fig.: Probability of decoding of visually very similar characters.
Eventually, we determine the most trustworthy values for each field and assign them confidences that may determine also whether they need further verification. We combine several methods – first, the neural network themselves output numerical confidence scores. Next – and this is one of the few places where we take advantage of the fact that we are extracting invoices – we assign high confidences to amount fields if we can confirm that various equations between amounts are working out, and we are confident about supplier specific information.
Third, we are able to check against a supplier’s history, and find out if we can trust these numbers because the operator already confirmed this same information for the supplier multiple times in the past. In effect, Rossum learns what to trust, and how to verify.
Does it work?
Cool ideas are worthless if they don’t work in practice, and a critical reader might as well ponder this right now.
The awesome part is, this technology works great!
In Rossum’s products as a whole, we are mainly looking at how much we speed up our users in their data entry tasks. The human-like AI technology we covered is an integral part of this value, as it brings high extraction accuracy and automation to the table.
The story for a new user is that they start at a base accuracy between 80% and 90% per field, as they take advantage of the universal model that has seen a diverse range of invoices; but then it can adapt further for the particular suppliers of the user, routinely reaching over 95% accuracy over the course of roughly a month of regular usage*. (Yes, just regular usage from the user perspective – we weren’t kidding about the “no setup” part before.)
The continuous adaptation is not rolled out for all customers yet, but just talk to us if you would like to experience this effect. It typically kicks in after the first few hundred invoices.
Of course, as you may have read recently, or as you may know if you’re a data scientist, the system is only as good as its initial assumptions. Though we don’t deal with issues of “bias,” in the sense of gender or race, we do deal in the biases of culture and historical practice that we inherit and which is inherited from the datasets we have used.
Like a human growing wiser, Rossum knows best what it has been taught. That means that today, certain differences in best practices will cause less initial certainty for new users of Rossum. Rossum will work slightly faster for a UK service invoice than for an invoice in English from East Asia, or a Swedish commercial invoice.
Still, the nice thing about a neural network is that it never gets old and it never has to sleep. It keeps on learning all the time. Because Rossum is constantly exposed to a broader and deeper set of data, it is also able to quickly teach itself about previously mysterious nuances. This virtuous cycle of learning will mean that each customer will both benefit from and contribute to Rossum’s growth in understanding.
Is it better than other OCR solutions on the market? There is no industry-wide accuracy benchmark for setupless data capture, so we can’t put percentages side-by-side at the moment – simply because no one ever pondered the concept of capturing business data without manual setup. Anything we tried was simply not on a comparable scale at all (capturing minority of fields on minority of invoices, unlike us), so all we have to go by are our amazed customers.
We are calling out any competition, though. Feel free to use our public API, even – anyone can sign up and start using it right away, without any hassle along the way. This is something barely anyone else dares to do, mostly hiding behind the veils of contact forms and on-demand demo setups. So don’t just take our word for it – go try out the live demo now.
Take the journey with us
We have watched the journey of a structured document through the human mind, and through Rossum’s technology stack. Now, we would like to invite you to take the journey of setupless data capture together.
A great technology is not all that makes a great product. Unlike some vendors, we are not going to tell you fairy-tales about 100% extraction accuracy – for the cases where it’s needed, we have built a great verification interface that can completely take over the invoice entry role in any system. However on the day we reach 100% accuracy all the time, for everything, then you will have been put out of a job, along with us. We think that day is still very far off.
Our goal today is not replacing people, but speeding up the human operators as much as possible, and giving businesses the flexibility and reliability they need to do more for their customers, faster, and better. Our whole user experience is optimized towards that, and that’s how we measure success. And in the spirit of “setupless,” of course it is in the cloud, and in your web browser anywhere. But talking more about the philosophy of our user interface is a matter for another blog post.
One More Thing
We have talked about invoices a lot in this series. But our technology is not invoice-specific, just as human brains do not have a specialized invoice cortex. Accountants have not had the time to evolve this yet.
Our vision is to eliminate all manual data entry in the world, so besides having a ready-made product for invoices, we are currently looking for the first adopters of our technology for other kinds of business documents. Imagine what you could do if you never had to read an invoice again, or a purchase order, or a car registration form, or any document which is mostly dominated by a general format. That’s what Rossum wants to offer you, for anything you need it to do.
What about that kind of journey? Write to us at email@example.com.
Data field capture for invoices ought to have been solved a long time ago! That’s what most people think, especially if they’ve never tried to actually do it.
That’s what we thought when we started talking to customers, looking for the ideal application of Rossum’s machine vision technology. It is genuinely surprising how hard this problem actually is, and how big an advantage a human mind has compared to a fixed algorithm. That’s also the reason Rossum’s approach stands out so much within this domain.
This is a special founder blogpost, in two parts written by the original minds behind Rossum’s technology – Petr, Tomas and Tomas. We will walk you through the concrete limitations of the current OCR systems, why we built Rossum, which lets anyone capture data from invoices without manual capture setup, and how it achieves this.
Who are we? Standard nerds, albeit with many big accomplishments between us in machine learning, computer vision, and AI. Just about 2 years ago, we decided to it was time to stop fiddling with AlphaGo and image recognition, and focus on one super-hard problem with a real impact on the lives of millions of people every day. Surprisingly enough, it turned out to be invoices. Here’s why:
Traditional Data Capture
Something we found out early about invoices was that the problem of reading invoices has been “solved” for decades, and yet the solutions don’t really work. Implementation is expensive and time-consuming, and even then, systems are prone to error such that they can never be fully automated.
The traditional approach to data capture from documents is to first generate a text layer from the document using an OCR (Optical Character Recognition) step, then recognize data fields using either image-based templates or text-based rules.
So far so good. This can actually work fine for documents that are scanned well and always have the same format without any variability, such as fixed forms or generated reports. The IRS and the postal system have had this technology forever. It can be a bit of a hassle to set up the recognition, but once you see the process through, you would be happy with the recognition reliability.
Rules for Variable Documents are Crazy
And yet… this approach breaks down badly once document variability becomes a factor, such as with invoices from suppliers. You’re likely to find that all your work in recognizing a particular kind of invoice has to be done all over again because of even a slight alteration of the format – alterations which happen constantly across industries.
The naive thing to do is to set up the recognition for each particular format of invoice, and people certainly try. But once you see the variability of invoices, you will realize how much of a losing battle this is.
Consider you’re Acme Technology Inc. You work with maybe 60 suppliers in 10 different countries. 60 companies means up to 60 distinct formats for invoices, any of which can change at any time. 10 different countries also means that there are potentially 10 legal standards that each company’s accounting offices also have to meet, and each of the 60 companies also has their own internal requirements for what invoices must include, and how they are formatted, so you can’t make all your suppliers follow your chosen standard. Some can’t, and others won’t.
Because invoices are often using mixed languages, you need to have OCR that recognizes all of them, and a set of text-based rules that take them into account. If you are using image based templates, then you need to make sure that anything scanned and submitted by a supplier is in the right format, isn’t rotated or blurred,, and nothing has actually changed from the original format your system is designed to detect.
And let us not consider what you do when an invoice contains unusual notations or line items your system isn’t prepared to handle. Also, keep in mind that this is happening with 60 different suppliers at once. Pretty soon the cost and complication of an OCR fails to justify its use. Companies go back to manual entry, or never adopt OCR to begin with.
The problem is hardly solved if no one uses the solution.
You need a lot of rules and templates, and eventually everyone gives up. This means in practice that even when OCR is adopted, only the suppliers producing the most invoices get the recognition implemented. Sometimes that’s 25% of invoice volume, sometimes 60%, but in most cases it falls very short from a complete solution. Moreover, you need to keep worrying as your suppliers change and their invoices change as well.
The problem is still not solved if you have to constantly worry that the solution will break suddenly and without prior warning.
The interesting thing that happens is that even suppliers using the same accounting software to generate invoices must almost always be handled separately – there is so much room for customization when generating invoices, and boy does everyone sure use that to its full potential. The crazy thing that doesn’t happen is that data capture users aren’t banding up to share supplier templates and examples with each other, so every project is done internally, from scratch. Maybe someone already did set up the rules for invoices from AWS or your plumber, but you will need to do it again anyway.
If you’re like us, coming from an Artificial Intelligence background, and not from within the traditional data capture mindset, then what we just described sounds pretty crazy and backwards. Yet this is the current situation, like it or not.
For large corporations, the better is often the enemy of the good. The bigger the organization, the higher the risk associated with changing these approaches. Because no solution has yet broken down the brick wall that OCR and dumb algorithms hit many years ago, there is little sense for large enterprises to invest heavily in improving their systems for tiny marginal improvements. If they throw away their old approach, it has to be really worth it. Effectiveness and reliability have to go from around 80% to 98% or more. A quantum leap is needed. That’s what we at Rossum quickly realized, and it is what we are now delivering to our clients.
Standard OCR is Just Not Good Enough – And Never Will Be
Let’s assume we go crazy and try the traditional recognition setup.
The trouble with image-based templates is that they are very sensitive to the scanning process, with zero flexibility in regards to document variability. Image based templates are the simplest possible approach, they’re effectively just saying: “this field is in this exact position on every document, or nothing works.”
Text-based rules give an initial illusion of flexibility – just bind a data field type to the obvious label phrasings. Surely there can’t be but a few ways of doing this. And thus have many engineers failed to appreciate a user’s ability to destroy our carefully laid plans.
Try to figure out good flexible rules for invoiced amounts, as shown in the images below – go ahead, we’ll wait.
Besides covering all the different phrasings, the non-obvious caveat is the false positives you could get – rules that are too universal would eagerly match at all sorts of wrong places and capture a different kind of information. Sub-Totals become Totals. Shipping becomes VAT. Dogs living with cats, mass hysteria!
This is why in the end, with sorrow in your defeat, you will restrict even text-based rules to each respective supplier. A fancy term to give this a guise of sophistication is “fingerprinting”.
The other problem with text-based rules is… that just like image-based templates, they are also sensitive to the scanning process! The rules match concrete text strings, and OCR is a noisy process. The hard truth is that OCR was originally developed for digitizing books and newspapers, and applying the exact same technology on business documents leads to all sorts of results, often ranging from interesting to funny (in a sad way).
First, OCR needs to detect that a text is present at a particular position at all. Invoices have an extremely complicated layout, and detecting all pieces of text on the page reliably may be a challenge the moment a slightest problem appears, ranging from too small font to a smear or an overlapping stamp.
Second, OCR makes letter-by-letter mistakes, especially in an unfamiliar setting. And “unfamiliar setting” may just mean reading a credit note rather than Shakespeare or a biology textbook. There aren’t many street names or dollar amounts in Shakespeare sonnets to be recognized, and the results follow correspondingly (i.e., not great).
And because OCR is noisy, text rules can still fail – it is not just about transcribing the value of a field, but also precisely transcribing the whole text label of the field, so that your painstakingly built text rules have a chance to match.
The practical implications? The painstaking text rules implementation just gets more painstaking as variations covering the most common OCR mistakes start popping up in the rules. Our newest customer maintained 13,000 lines of rules amassed over a few years of operation. Budgets are destroyed, timelines the same, and still you’re stuck with less than okay results.
This is the way it works, it’s just the way it is. Just as travelling between Europe and the Americas would take weeks by ship before air travel, you would run only a single program at once in DOS. The world is using a technology, OCR, which was designed to digitize texts that were printed out in books, by professionals. It’s like trying to find things on the internet by guessing and checking domain names – which is actually something people used to have to do.
But with the recent advances in Artificial Intelligence, we can be more daring. Rather than look at the current fixed algorithms and fine-tune them further, we can take a step back and look from the opposite, radical perspective: why are computers so bad at this when humans can do the task really well?
In the last 5 years, thinking about how humans find information to teach computers the same approach suddenly isn’t a crazy notion – instead, it has become a proven strategy that works for automating routine tasks thanks to neural networks, deep learning and big datasets.
Self-Teaching: Humans Don’t Need Templates
Traditional OCR software that was built to digitize books and articles takes a completely sequential approach to reading a page – it just starts at the top left corner and goes line by line all the way down to the bottom right corner (or the opposite for Chinese or Hebrew). That’s fine – humans do the same thing when reading an article, more or less.
And yet a human can also look at a page of text and instantly understand *what it is.* A human can skim, and find specific information without reading everything. I know the difference between a string of random characters and the opening of Moby Dick. OCR doesn’t know that. To OCR, it’s just “Call me 1shm4€l.”
When we move to business documents, we see even more why that skill of self-teaching and self-directing one’s attention is so important. Humans do not read such documents from the beginning to the end, rather they are just looking for specific bits of information, skimming the invoice, darting their eyes back and forth and looking for key points. Reading the document precisely and letter by letter just isn’t necessary, and humans can go long way based just on a visual structure of the document – try it yourself:
But this is very much not how the traditional OCR software is reading business documents, it is reading them just like books. If they are matching rules to capture data, precise letter by letter reading is essential – “T0ta1 amourt” will just not match a rule that looks for the text “Total amount.” A human doesn’t care what letter it is, they just see whether this is the value they are looking for, and record it and move on without a second thought.
This is because humans are very good at self-teaching, developing what we know as “intuition,” the sum of all we know about our experiences that leads us to “know” things even when we don’t consciously know them. We can easily adjust our understanding of the whole document based on the fragments we skim through, and then decide to go back and look for other relevant information based on what we see. You could even derive from the usage of certain words or phrases, based on the context, what they mean and how they function. No traditional machine can do this or anything like it.
If I showed you a brand new format of invoice you’d never seen before, even in a language you aren’t familiar with, you would still be able, with no other outside inputs, to quickly analyze and digest the important information. You could teach yourself how to read it. Humans are just really good at that.
Next: The Rossum Approach
We looked at the fundamental limits of the traditional OCR systems regarding information extraction from business documents, and discussed how hard it is to push the accuracy further with this approach. At the same time, we saw how much more sense the human-like approach to data capture is, so the obvious question is whether we can replicate it with modern deep learning technology. And the answer is – yes! This is all that Rossum’s technology stack is about. We will delve into the details of that technology in the second half of this series, stay tuned.