If you have purchased Dedicated AI Engine functionality, Rossum will automatically train its AI to fit your specific needs (custom fields, specific document types, etc.). The training process runs in the background while you are using the Validation interface, however, it’s important to follow the best practices that are mentioned below to ensure that your Dedicated AI Engine achieves the highest possible accuracy.
Note: Although Rossum currently supports pre-trained fields only for Invoice processing, the technology is document-agnostic and can extract data from any semi-structured document including receipts, purchase orders, shipping documents, etc.
Introduction to document annotation
- Annotating a document basically means tagging a certain value on the document by dragging a rectangle around that value or pointing and clicking on a value, creating a rectangle enclosed by a dashed line. The data in said rectangle should be automatically extracted to the relevant field on the sidebar.
- Data that were originally extracted by Rossum and are correct should not be annotated again.
- The Extraction schema of the Queue (sidebar configuration) should be set up before you start annotating. Test your schema if it contains all the fields you need before you start annotating. Do not make adjustments to the schema during the annotation process and do not start the annotation process without confirmation from the Rossum team.
- For the purpose of the model training, annotate the documents only in one queue designed for annotating, unless you have a different agreement with the Rossum team.
- During the annotation process, no connector extension changing the values of extracted fields shall be deployed in the queue. Extensions not influencing the extracted values are allowed. Discuss your setup with Rossum if you are using custom extensions.
- Let us know after processing the first 30-50 documents. We will review the annotations and share any feedback, to make sure the resulting data are high quality and in the form most suitable for the Rossum’s AI Core Engine.
- Please note that the quality of the model will very much depend on the quality of your annotations. If your annotations are inconsistent or incorrect, it will affect the accuracy of the model in a negative way.
- Let us know when you annotated the minimum required number of documents (as agreed with Rossum automation expert). We will start the Custom Model training process at that point. Please allow 2-3 weeks to complete the first training iteration.
- Annotate a representative sample of your documents. If 50% of your documents come from Supplier A and 5% from Supplier B, the ratio should be kept approximately the same during the annotation. An exception can be if you are experiencing very unsatisfactory extraction on the documents from Supplier B. In such a case, the ratio of the documents can be a little higher than usual. In any case, try to annotate at least 15 documents per supplier.
- If a value is mentioned on more pages of a document, annotate the value on the first page.
- If a value needs to be annotated, but it is not situated on the first page, it is okay if you annotate it on the next pages of the document.
- If a value is printed on multiple locations, always annotate the value in the same location. The choice of the location is yours, but we suggest to choose the location where you would intuitively look for the value by yourself as a person. Keep this in mind especially if more users will be annotating your documents.
- A model is always trained against the image. If the value is not present on a document, the model cannot learn to extract it. If you are adding values to some fields by typing them in manually (without specifying a location) or programmatically, such data will not be used to train a model.
Invoice annotation best practices
- If a value is a header field, do not annotate it in the line item table on the invoice image, even if the number is correct. Typically, this is often a case of the total amount being the same as a line item amount if there is only one line item on the invoice. See an example.
- Except for the Terms field (see the image example below), only data values should be marked for the best AI performance. For example, when annotating the Total amount, number 1234,55 should be annotated (instead of 1234,55$); when annotating a code, only 1234 should be annotated (instead of Item code = 1234).
Line Items annotation
- Tables for Tax details may be annotated in the same style as header fields, by labeling the individual field values manually if needed.
- When annotating the main line-items table, use only Magic Grid, do not edit the line items value by value manually. When using a Magic Grid, drag the grid lines over the data itself. If the table doesn’t fit Magic Grid in some common cases, please contact Rossum to discuss these.
- When marking columns using Magic Grid, put the grid lines closer to the column or row edge rather than putting them in the middle of a large white space between two columns/rows.
- If you want the engine to ignore rows that do not belong to the data, press the X on the right side of the row. Such a row will turn grey.
- It is beneficial, but not required, to annotate the headers containing column names. It will lead to slightly higher accuracy but requires that format checking is disabled for all your columns. To mark headers, your schema must be configured appropriately; we will discuss with you whether to proceed with this. If enabled, mark the header row appropriately by clicking on the window in the blue square on the right side of the row. Such a row will turn blue.
- Do not mark subsequent headers or footers unless you were asked to do so by the Rossum team. If you were asked to do so, mark such rows appropriately by choosing the correct option from the drop-down menu on the right.
Invoice annotation Examples
Below, please find some additional instructions for specific fields that may come in handy:
Invoice example: Terms
For example NET 30 days, NET 14, 14 days, etc. It can also be mentioned in the sentence where we only mark a given condition. Delivery terms do not belong here!
Invoice example: Total Amount (header field)
Annotate only total amounts where written as the sum of all items. Do not mark these amounts in the line items table! Amounts for each row should be marked in the line items table only.
Example of a wrong annotation: