Our researchers have been hard at work solving line item extraction from invoices over the past few months. We have made several leaps forward we will talk about in the post, and you can test to see how it works in our live demo, right now! The rest of the post is aimed at our fellow data science geeks and all about our performance indicators concerning line items.
Quick update about line items in Elis
We have a lot of results we are eager to share, as well as just the story of our journey towards breathtaking table extraction capabilities. Whether it's intense hackathons, exciting competition among multiple teams, or just endless thorny issues on how to annotate something as complex and varied as invoice table data – but that's for another blog post.
The gist of the story, short & sweet, is that users of the Elis verification interface can now enjoy dramatic quality improvement in the semi-automatic line item capture we offer, and fully automatic table extraction is now available in our Data Extraction API as an experimental feature.
One number to describe our precision
Current industry and quality standards demand evaluated measurements on the quality of service - the days are gone when 'it works sometimes' is better than having no service at all. Usually, in the data science field, performance of an algorithm is evaluated using common standard metrics that can be computed when we know the algorithms' output (predictions) and the truth value (gold standard or annotations). A quick example would be to count the number of mistakes (in the text, referred later as 'bads') and the number of correct outputs (referred to as 'goods').
When we look at two tables, it might not be clear what exactly to count. So, what does the literature have to say? Usually, our problem is referred to as 'table detection' and 'table extraction', where first we need to find the table and then we need to extract the data from it. In the literature, there was a metric based on connections in the table, that was first used on ICDAR 2013 to evaluate heuristic table detection and extraction from scientific texts. For table detection, they used a simple comparison of similarities of rectangles, called 'intersection over union', where a 'good' detection is one with an IoU score above 0.7, arguing that we do not need to draw the rectangle around the table in an overly precise manner.
The ICDAR metric for table extraction is based on a simple idea - that tables tell us about column-wise and row-wise relations where we can count how well the relations are fulfilled. The metric is defined as follows: for each cell in the table, let's take the first non-empty (a cell that has a meaningful text) right-hand neighbor and first bottom neighbor; record the two cells' distance, texts and direction, thus forming a record of four items. For the predicted and ground truth table, we get two lists of records, which we can match against each other and count common records ('goods'), missed records and extra records. From these integer values, the data science field commonly calculates metrics as precision, recall and f1 score - so far so good.
We have implemented the metric, used it a bit, and found out that it is not the easiest to explain and also does not strictly follow how a human would judge the table prediction against the gold standard. So, we came out with our own metric that is trying to mimic how humans would look at it. If I had two tables and had to say how similar they are, the first problem I would face is that the tables can be of different sizes. To reduce our problem, I would cut out rows and columns, so that the tables would be the same (smaller) size. Which columns and rows should we cut? I would look for those that have the craziest content, looking like nothing from the second table (from implementation point of view, the keyword is edit distance). We would have two tables of the same size and then I could look at each cell and decide if it matches or not.
The numbers (good and bad), together with the number of cells cut from both tables (meaning a miss or an extra, based on which table we cut it from), can give nicely defined metrics. The algorithm for cutting out rows or columns does not need to be overly clever, because as the user, I would not spend a lot of time looking at all combinations. We want to capture the feeling in the metrics. Validation is done, when we look at the table and it's computed score and say 'that looks like 60%', nodding our heads in wise agreement. The nice property here is that if the prediction is perfect, we will get a perfect score regardless of the used algorithm for deciding which rows/columns to cut (and similarly - no table extraction would yield 0%).
Let's take this table:
Lets assume, it gets extracted as:
Then the number of rows (2) is the same in both tables, but the number of columns differ (the algorithm merged some columns), so we would cut out the worse columns from the original (specifically 3: quantity, tax, product total) and we would end with statistics like this:
- good: 5, CODE, 1234, DESCRIPTION, CIRCUIT BR...,
- bad: ORDERED SRKAITEY, UNIT PRICETAX PRODUCT TOTAL, 10.0050.00
and adding the number of cells we did cut out:
- miss: 6 (= 3 cols, each with 2 rows)
We get the score for each table in each document, and in our development process we are averaging these numbers (f1 scores) into one final score. This process is called 'macro metrics', as opposed to 'micro metrics', which would first sum all the goods, bads, misses and extras and then compute precision, recall, and f1 score from it.
We wanted to share our methodology of evaluation in a way that could inspire others (not only data scientists), open a discussion and provide some links to the literature that has inspired us. Keep your eyes peeled for more updates in the near future as we continue our quest to solve line item extraction from invoices and feel free to reach out and comment on the methodology (or implement it as you wish).