From this article you will learn:
- What are the annotations, and why are they so important
- How to ensure high-quality annotations
- How to capture table data in Rossum
What are the annotations, and why do you need them?
Annotations refer to all the captured data from a document. You can recognize them by the blue bounding boxes (b-boxes) that appear on the document after it is processed in Rossum.
To ensure that the AI engine will learn what data you need to capture, you need good-quality annotations.
How to ensure high-quality annotations?
You should always keep a few simple rules in mind if you want high-quality annotations. Following them will help optimize and improve the accuracy of your AI engine.
1. Keep the annotations consistent and precise
Consistency is essential for the engine to learn to capture the data correctly. That is why you should maintain it in your annotations. It is best to capture a value from the same place on documents with the same layout and/or from the same vendor every time.
The bounding box borders should go around the data, not through it.
You should also avoid lines or other characters that do not belong to the correct value, as well as avoid capturing too much whitespace.
If you use a grid, place the separators (columns and rows) as close to the values to be extracted as possible.
2. Always annotate all values on the document
If there is data on the document with a corresponding field in the schema for its extraction, you should annotate it on each document where it is present. Please be sure to do it, even if you may not need it to be extracted for a particular vendor or in a specific case.
Amounts should also always be annotated, even if the value on the document is “0”.
3. Only annotate values that appear on the document
If there is a value on the invoice, it should be captured. Please do not enter a value that is not on the invoice manually. The engine cannot learn to populate the values based on your manual input.
4. Annotate related data from the same location
It is always preferable to annotate data logically connected and close to each other rather than capturing it from two different locations.
For example, it’s better to annotate the supplier name next to the supplier address rather than on another location far from it.
5. Annotate data from preferred locations
Data should be annotated from the preferred location whenever possible.
We recommend annotating the value the first that it appears on the document. For example, annotate vendor information in the header rather than the footer. Also, annotate values on the first page rather than on other pages.
6. Avoid overlapping annotations
While it is occasionally fine to extract data from the same place, you should avoid overlapping annotations. Taking the same value for several fields may confuse the engine and lead to lower confidence in predicting those fields.
7. Focus on each field and annotations of correct values
When capturing the data, pay attention to each field and annotate the correct values. Always check that the values predicted by the engine are correct.
If you find any typos or other errors, try adjusting the bounding box to get the correct value.
8. Annotating tax details
When annotating tax data (e.g., tax rates, tax amounts, base amounts, etc.), make sure to annotate the related values together. All tax data should be in the tax tables.
Values that are in the document total table (usually, the total base amount or subtotal, total tax amount, and total amount with tax) should be captured in the corresponding header fields.
9. Annotate the data values
Annotate only the data values, not the labels. When annotating a PO number written as “PO. no.: AB1234”, for example, only annotate “AB1234”.
10. Annotate the data values
If the same value appears in the logo, footer, and body of the document, choose the one with the more standard font and size.
How to capture table data in Rossum
Annotate table data in the Line Items and header data in the header fields
Unless otherwise instructed by Rossum, you should annotate the header fields as header fields and line items as line items. If the invoice only has one line item, the amounts are usually the same as the line item amounts. Even if the amount is the same, annotate it in the table footer rather than within the line item.
Use the Magic Grid to annotate structured line item tables
The Magic Grid is a helpful tool for quickly annotating structured line item tables. These are tables with data placed in separate columns, with one data type per column, and each line item in a different row.
It is possible to annotate them by pointing and clicking, but if the table contains many values, it will be faster to use the Magic Grid. Use it for all the data you can extract from such a table.
Drag the grid over the data and adjust it as needed. You can adjust the grid by moving the separators up or down, adding or removing labels, ignoring rows you don’t want to capture, and so on.
When using the Magic Grid, remember to place the column and row separators close to the values. As mentioned in the first point, make sure not to capture too much whitespace.
You can find more information on how to use Rossum’s Magic Grid in our user guide article.
Use Magic Items for semi-structured tables
Note: the Magic Items button is not available by default. If you don’t see it, please get in touch with your Rossum representative so that we will add it for you.
Semi-structured tables contain nested values, which means that several values are in the same column or across multiple columns. It makes it difficult to capture data by simply adjusting the grid. In these cases, the quickest and best option is to combine the Magic Grid and Magic Items (a point-and-click approach to capturing nested values).
Always use the grid to annotate as much as possible. For the remaining data, use your mouse to extract the data in one row by pointing to them and clicking on the b-box around the data or drawing it from scratch around the data.
After capturing the first row’s data, click the “Extract nested values” button. Based on your input, Rossum can complete other values in the table’s columns. Furthermore, based on the feedback, the AI Engine will improve over time.
You can also watch our short video on data capture in structured and semi-structured tables:
If capturing data from such complex tables is already bothering you, do not hesitate and contact us at email@example.com.