The Future of Data Capture Systems (2/2): The Rossum Approach
This is the second part of our special founders blog post on data capture technology and how Rossum represents a radically different approach to the whole problem of information extraction from business documents.
Last time we covered the fundamental issues of the traditional OCR systems due to their machine-like approach, in contrast with the magical efficiency of human mind in this task. Now, it’s time to delve into how exactly we replicated the human approach using deep learning, and show that it certainly is delivering as promised.
The Rossum approach: Imitating humans
Let’s take the same journey a business document travels in its own lifetime. From an idea in somebody’s head, to an idea in somebody else’s, and from one table of figures at one company, into another table of figures at another company.
When we write down a bunch of ideas and then send those ideas to another person for them to read and act upon, we format this data only to the extent that the formatting helps us to convey the key ideas. For example, a shopping list. We need not include orders to buy every item individually and in order (“Buy lettuce, then buy carrots”), nor the specific sizes or brands or locations in the store, nor the store itself. We may even convey complex ideas in shorthand: “salad fixings,” or “cheeses.” Items can be “fancy,” or “the cheap one.”
The person on the other end of this communication can easily pick up and sort out this very simple data set and construct around it many actions which must be taken to react to it. The human mind does this without much effort, and in this way we accomplish amazing coordination of tasks with few problems.
Machines… not so much.
Business information, such as that in an invoice, usually begins life in a table of figures, with notes or descriptions of what the figures are. That table is then used to generate a document, where the figure is put together with other figures and data from other tables, based on a very complex and verbose set of rules, and sent off to the recipient. From there, all the gathered bits of data must somehow make their way into a new set of structured tables for the purposes of the new organization that has received the data.
All the rules that went into constructing the document must be understood in advance by the recipient, or else nothing will be understood. Any deviation from this, even a comma in the wrong place, can break this intricate system and produce nothing but error messages. There is no room for improvisation and also no ability to recognize subtlety or consider alternatives.
What should be obvious on comparing these two processes is that data transmission between highly automated businesses creates a very tight bottleneck: data can only pass between them in a very specific and predictable way. No emojis. No “you’ll figure it out.”
What we’ve found out in the process of designing a new approach to business information is that humans process such business documents very differently from traditional OCR systems – instead of reading them from the beginning to end, letter by letter, they find information by looking at the document as a whole, taking in the layout, and only focusing on the key locations containing the important data.
We have broken down this approach into three major steps, which we imitate in our neural networks: 1) SKIM-READING 2) DATA LOCALIZATION, 3) PRECISE READING.
When seeing a document page for the first time, the initial take-in represents a sudden rush of information on a very rough level. Humans call this a “glimpse,” and we use it to sort out patterns and potential meanings all the time. Its contiguous with your ability to recognize a firetruck driving down a street, without analyzing any specific qualities that make it a firetruck. It just has “firetruckness.”
We initially notice the layout, but also the general structure of letters at various points of the page – where the amounts and dates are, which texts are large and small, if there is a number in a top right corner of the page, etc. We recognize “invoiceness.”
Next, to find out which amount is which and what dates to look for, we glance ever so briefly at the text labels next to the values to get the rough idea.
The British Dictionary offers a perfect word for this:
To replicate this kind of skim-reading within machine data capture, we use a very special kind of spatial OCR. Instead of converting the page to a stream of text, we determine the letter present on each individual spot of a page, building a space map of where each letter is. We can still recognize words by looking at places adjacent to each other, yet layout is the primary subject of representation rather than an afterthought.
Imagine words cut out of newspapers and magazines in the classic film ransom note. We find this easy to read, but more importantly, the visual grammar communicates something of the purpose of the note even without reading it.
In visual representations for humans, this effect is often used to communicate information that is non-textual. For example, when you see a rush of data about a specific person race across a computer screen in a spy thriller, just giving you the general idea of what’s there in the database, but not so that you can really read it.
2) DATA LOCALIZATION
Numerous studies have analyzed how humans read structured documents by tracking eye focus and movements. An eyetracker can record what part of a page the reader sets their gaze on for how long, nicely documenting the human information intake process.
For example, researchers at the Chia-Yi University investigated the learning process when reading a textbook text accompanied by diagrams (paper; the figure on the top), while UX designers at TheLadders looked into how humans look at CVs (paper; the figure on the bottom).
The first impression is obvious – humans read structured documents in a very non-linear way, darting their eyes all around and focusing only on specific spots which they intuit would contain the key information. This is perhaps disturbing for anyone who would expect that a potential employer might carefully read their entire CV. Clearly that is not the case!
We took the same principle and built our technology around that – much like humans, our neural networks look at a document page holistically and using this global, even if rough view to identify label candidates (based on just layout, or short text areas corresponding to various data labels). These are the places for our AI to focus on and read carefully, and each of them is already associated with the type of information that could be found there.
Much like a human might check several possible places for a single piece of information, Rossum’s neural networks are eager to over-generate candidates – then they read them carefully and throw away the false ones when more information is available, based on confidence scores. This is why Rossum misses information so rarely.
An interesting phenomenon is worth noting – traditional rules are typically very template-specific, which is understandable if you think back to the variety of just how total amounts can be encoded in the first post of this series. The neural localization technology automatically learns something comparable to the traditional rules (albeit more flexible and scan-robust). But because it is based on deep learning (Haven’t heard about Deep Learning? Read more e.g. in this article), it also acquires the common property of deep learning algorithms.
Image recognition in self-driving cars is able to recognize even car models it has never seen before, because it generalizes from many examples of previously seen cars. Speech recognition from Google can understand even a voice it has never heard before, because it learns the general form of human speech. And just like that, our field localization does not simply memorize all invoice templates it has seen before – it acquires a general understanding of a typical invoice structure, and therefore can detect most information even in invoices unlike any it has ever seen before (as long as they are not totally dissimilar to what we are used to).
3) PRECISE READING
Whereas the technical details of our unique skim-reading and localization approach represent the essence of Rossum’s technological secret sauce, the final capture step gives it the required robustness to seal the deal as today’s top data capture system.
When we have identified multiple focal points on the page, it is time to precisely read and evaluate the information. For that, we increase the precision of character location, carefully transcribe the final text, and assign it a confidence score.
If you were to draw a box around a piece of information, your eyes would at first glance at the middle of that field, then work out exactly where its boundaries are so that you can determine the precise edges. It should come as no surprise anymore that Rossum works the same way, with a dedicated neural network to determine these boundaries from the focal points.
Fig.: An isolated data field at the top, and raw output from our precise OCR at the bottom – the height of the yellow dots corresponds to a particular letter found at that spot.
Only at this point, with the precise area of the field already determined, we come to the classic OCR task of precisely transcribing linear text from the picture! But at this point, we have much easier time than traditional OCR systems. First, we know where to look, so we do not need to detect chunks of text in the page picture (a highly error-prone task by itself). Second, we already know what type of field are we reading, which makes a huge difference. Just as it is hard to transcribe human speech if you do not know the vocabulary, it is hard on bad scans to tell letters from each other if you do not whether you are looking at a numeric field or a piece of text.
Fig.: Probability of decoding of visually very similar characters.
Eventually, we determine the most trustworthy values for each field and assign them confidences that may determine also whether they need further verification. We combine several methods – first, the neural network themselves output numerical confidence scores. Next – and this is one of the few places where we take advantage of the fact that we are extracting invoices – we assign high confidences to amount fields if we can confirm that various equations between amounts are working out, and we are confident about supplier specific information.
Third, we are able to check against a supplier’s history, and find out if we can trust these numbers because the operator already confirmed this same information for the supplier multiple times in the past. In effect, Rossum learns what to trust, and how to verify.
Does it work?
Cool ideas are worthless if they don’t work in practice, and a critical reader might as well ponder this right now.
The awesome part is, this technology works great!
In Rossum’s products as a whole, we are mainly looking at how much we speed up our users in their data entry tasks. The human-like AI technology we covered is an integral part of this value, as it brings high extraction accuracy and automation to the table.
The story for a new user is that they start at a base accuracy between 80% and 90% per field, as they take advantage of the universal model that has seen a diverse range of invoices; but then it can adapt further for the particular suppliers of the user, routinely reaching over 95% accuracy over the course of roughly a month of regular usage*. (Yes, just regular usage from the user perspective – we weren’t kidding about the “no setup” part before.)
The continuous adaptation is not rolled out for all customers yet, but just talk to us if you would like to experience this effect. It typically kicks in after the first few hundred invoices.
Of course, as you may have read recently, or as you may know if you’re a data scientist, the system is only as good as its initial assumptions. Though we don’t deal with issues of “bias,” in the sense of gender or race, we do deal in the biases of culture and historical practice that we inherit and which is inherited from the datasets we have used.
Like a human growing wiser, Rossum knows best what it has been taught. That means that today, certain differences in best practices will cause less initial certainty for new users of Rossum. Rossum will work slightly faster for a UK service invoice than for an invoice in English from East Asia, or a Swedish commercial invoice.
Still, the nice thing about a neural network is that it never gets old and it never has to sleep. It keeps on learning all the time. Because Rossum is constantly exposed to a broader and deeper set of data, it is also able to quickly teach itself about previously mysterious nuances. This virtuous cycle of learning will mean that each customer will both benefit from and contribute to Rossum’s growth in understanding.
Is it better than other OCR solutions on the market? There is no industry-wide accuracy benchmark for setupless data capture, so we can’t put percentages side-by-side at the moment – simply because no one ever pondered the concept of capturing business data without manual setup. Anything we tried was simply not on a comparable scale at all (capturing minority of fields on minority of invoices, unlike us), so all we have to go by are our amazed customers.
We are calling out any competition, though. Feel free to use our public API, even – anyone can sign up and start using it right away, without any hassle along the way. This is something barely anyone else dares to do, mostly hiding behind the veils of contact forms and on-demand demo setups. So don’t just take our word for it – go try out the live demo now.
Take the journey with us
We have watched the journey of a structured document through the human mind, and through Rossum’s technology stack. Now, we would like to invite you to take the journey of setupless data capture together.
A great technology is not all that makes a great product. Unlike some vendors, we are not going to tell you fairy-tales about 100% extraction accuracy – for the cases where it’s needed, we have built a great verification interface that can completely take over the invoice entry role in any system. However on the day we reach 100% accuracy all the time, for everything, then you will have been put out of a job, along with us. We think that day is still very far off.
Our goal today is not replacing people, but speeding up the human operators as much as possible, and giving businesses the flexibility and reliability they need to do more for their customers, faster, and better. Our whole user experience is optimized towards that, and that’s how we measure success. And in the spirit of “setupless,” of course it is in the cloud, and in your web browser anywhere. But talking more about the philosophy of our user interface is a matter for another blog post.
One More Thing
We have talked about invoices a lot in this series. But our technology is not invoice-specific, just as human brains do not have a specialized invoice cortex. Accountants have not had the time to evolve this yet.
Our vision is to eliminate all manual data entry in the world, so besides having a ready-made product for invoices, we are currently looking for the first adopters of our technology for other kinds of business documents. Imagine what you could do if you never had to read an invoice again, or a purchase order, or a car registration form, or any document which is mostly dominated by a general format. That’s what Rossum wants to offer you, for anything you need it to do.