If you have purchased the dedicated AI engine functionality, Rossum will automatically train its AI to suit your specific needs, such as custom fields or specific document types. The training process runs in the background while you are using the validation interface; however, it’s important to follow the best practices that are mentioned below to ensure that your dedicated AI engine achieves the highest possible accuracy.
Note: Although Rossum currently supports pre-trained data fields solely for processing invoices, the technology is document-agnostic and can extract data from any semi-structured document including, but not limited to, receipts, purchase orders, and shipping documents.
Introduction to document annotation
- Annotating a document basically means tagging a certain value in the document by dragging a rectangle around that value, or pointing and clicking on a value, which creates a rectangle with a dashed-line border. The data in the rectangle should be automatically extracted to the corresponding field in the sidebar.
- Data that was originally extracted by Rossum and is correct should not be annotated again.
- The extraction schema of the queue (sidebar configuration) should be set up before you start annotating. Test your schema to ensure it contains all the fields you need before you start annotating. Do not make adjustments to the schema during the annotation process and do not start the annotation process without confirmation from the Rossum team.
- For the purpose of the engine training, only annotate documents in one queue designed for annotating, unless you have a different agreement with the Rossum team.
- During the annotation process, no connector extension that changes the values of extracted fields can be deployed in the queue. Extensions that do not influence the extracted values are allowed. Discuss your setup with Rossum if you are using custom extensions.
- Let us know after you have processed your first 30 different documents. We will review the annotations and share any feedback to ensure the captured data is high quality and in the form most suitable for Rossum’s core AI engine.
- Please note that the quality of the engine will very much depend on the quality of your annotations. If your annotations are inconsistent or incorrect, it will have a negative impact on the engine’s accuracy.
- Let us know when you have annotated the minimum required number of documents (as per your agreement with Rossum’s automation expert). We will start the dedicated AI engine training process at that point. Please allow 2-3 weeks to complete the first training iteration.
- Annotate a representative sample of your documents. If 50% of your documents come from Supplier A and 5% from Supplier B, the ratio should be kept approximately the same during the annotation process. You can make an exception if you are experiencing unsatisfactory extraction results for documents from Supplier B. In such a case, the ratio of the documents can be a little higher than usual. In any case, try to annotate at least 15 documents per supplier.
- If a value is mentioned on more than one page of a document, annotate the value on the first page.
- If a value needs to be annotated, but it is not situated on the first page, you can annotate it on the next pages of the document.
- If a value appears in multiple locations, always annotate the value in the same location. The choice of the location is yours, but we recommend you choose the location where you would intuitively look for the value yourself. Keep this in mind, especially if other users will be annotating your documents.
- An AI engine is always trained against the image. If a value is not present in a document, the engine cannot learn how to extract it. If you are adding values to some fields by typing them in manually (without specifying a location) or programmatically, such data will not be used to train the engine.
Invoice annotation best practices
- If a value is a header field, do not annotate it in the line item table in the invoice image, even if the number is correct. Typically, this is often a case of the total amount being the same as a line item amount if there is only one line item on the invoice. See an example.
- Except for the Terms field (see the image example below), only data values should be marked for the best AI performance. For example, when annotating the total amount, number 1234,55 should be annotated (instead of 1234,55$); when annotating a code, only 1234 should be annotated (instead of Item code = 1234).
Line items annotation
- Tables for tax details may be annotated in the same style as header fields, by labeling the individual field values manually if needed.
- When annotating the main line-items table, use only Magic Grid; do not edit the line items value by value manually. When using a Magic Grid, drag the grid lines over the data itself. If tables don’t fit Magic Grid in some common cases, please contact Rossum to discuss them.
- When marking columns using Magic Grid, put the grid lines closer to the column or row edge rather than putting them in the middle of a large white space between two columns/rows.
- If you want the engine to ignore rows that do not belong to the data, click the X on the right side of the row. This row will turn grey.
- It is beneficial, but not required, to annotate the headers containing column names. It will lead to slightly higher accuracy but requires you to disable format checking for all your columns. To mark headers, your schema must be configured appropriately; we will discuss with you whether to proceed with this. If enabled, mark the header row appropriately by clicking on the window in the blue square on the right side of the row. This row will turn blue.
- Do not mark subsequent headers or footers unless you were asked to do so by the Rossum team. If you were asked to do so, mark these rows appropriately by choosing the correct option from the drop-down menu on the right.
Invoice annotation examples
In the following sections, you’ll see additional instructions for specific fields that may come in handy:
Invoice example: Terms
For example: NET 30 days, NET 14, 14 days, etc. It can also be mentioned in the sentence where we only mark a given condition. Delivery terms do not belong here!
Invoice example: Total Amount (header field)
Only annotate total amounts where they are presented as the sum of all items. Do not mark these amounts in the line items table! Amounts for each row should be marked in the line-items table only.
Example of an incorrect annotation:
Invoice example: Tax Details
Invoice example: Line items annotation example