Rossum’s Training Data
Every Machine Learning project needs training data, gathering this training data is pivotal for the project success and for many challenging problems, it’s not a simple matter either.
Well, Rossum’s Invoice Robot is no different! In order to teach our machines to recognize data fields on invoices, we need thousands of example invoices. These invoices also have to include the “right answers” we want from the robot – the value of each data field we are extracting, together with its precise location on the page. We use these invoices to train all capture phases – the data field localization as well as our tailor-made OCR stages.
Since training data is key to our project, we give it a lot of attention, both technically and business-wise. Unfortunately, though many of our partners offer us invoices for training, they rarely include the data field locations, so we have an internal team of annotators whose job is to label the invoices with all the necessary information using our internal tools.
We still need the invoices to train our system, though! In the countries where we already have partners who are using our service, this is easier as we can use the invoices that were submitted to our cloud API – we even offer a feedback service where users can specifically mark invoices that were processed wrong.
However, we are rapidly expanding our service to more countries where we are yet looking to establish a user base. To bootstrap our efforts, we are mining the internet for invoices. It is surprising what one can find when digging hard within the publicly available data! We are using mainly automated strategies, but also external assistants who provide higher quality data that could also be used as seeds for further automated searches. Hopefully we’ll tour you through some of our automated strategies in some of the blog followups.
(Sometimes, the results of our automated downloads can be a bit peculiar. There’s a site called “Cars Below Invoice” with many pictures of cars which we considered as potential invoices. Furthermore, our OCR has been so befuddled by suddenly seeing a car seat instead of a document that it has given a bit of a pause to train a good classifier to discern these from real invoices.)
Many machine learning projects have trouble with getting good training data. This is a big challenge for example for chatbot efforts or building robots that need to learn how to take actions that affect their environment. We feel excited that we in fact have a perfect formula for a supervised machine learning scenario each ML practicioner dreams about – our crisp inputs and target labels mean a clear path to take for our neural networks.