From this article you will learn:
- How to access and use our schema editor
- How to add new or edit existing section
- How to add new or edit existing field
- How to configure your schema correctly
First steps in the extraction schema setup
In the extraction schema, you can define the data to be captured from the documents uploaded to Rossum. In most cases, each queue has its own schema (it is possible to link one schema with multiple queues, but this is not a default configuration).
To customize the schema associated with one of your queues, visit Queue settings. You will find the editor in the Fields to capture tab.
This tab contains a section-by-section list of the fields included in your extraction schema. As an illustration: Basic information is a section that holds fields like Document type, Document language or Document ID.
On the main screen you can:
- Change the order of the fields in your schema – just drag and drop any field using the “six dots” icon.
- Decide whether the field should be visible or hidden by clicking on the “eye” icon.
- Edit a section or field.
- Remove a section or field from your schema completely by clicking on the “bin” icon.
- Add a new field.
- Add a new section.
- Open JSON code editor – this is an alternative way to adjust your schema, for users who prefer to work directly with the code.
Adding and editing a section
Section is the container that holds fields in your schema. It represents parts of the document such as Amounts or Vendor and Customer information. A schema should have at least one section.
When adding a new section or editing existing one, you need to provide the following information:
- Label – it is the name of the section. It can be anything that allows you to identify what type of information the section contains.
- ID – section unique identifier used in exports and integrations, e.g. to map the data.
Adding and editing a field
When you add a new field or edit existing one, we are going to ask you to provide following information:
- Label – the name of the field that will be visible for annotators on the validation screen and in the export results (.csv and .xlsx files).
- ID – unique field identifier used in exports and integrations, e.g. to map the data.
- Data type – determines the type of data, e.g. string, number, date, enum, multivalue field, or table. Please find more information here.
- Format – available for two types of data – date and number.
- Required – if you mark the field as required, you will not be able to confirm the document until the value is added.
- Visible – you can decide whether the field should be visible or hidden for the annotators on the validation screen. If you hide the field it will not be removed from the schema and the value can still be extracted.
- RIR field names – this attribute determines which value should be presented in a field. To give you an example – our Accounts Payable and Receivable AI engine is already able to recognize certain fields (you can find full list here). If you use this engine and want to capture bank account numbers, you should create “Account number” field (you can use any label) and set “account_num” as the RIR field names to let the engine know what value you expect to get in this field.
Important: in case you would like to create a field that our pre-trained engines don’t know – custom field – please leave the RIR field names attribute empty. If you are preparing the schema for dedicated engine training, we are going to generate RIR field names and add it for you once the engine is ready to recognize the value automatically.
Before you start
- If you are using dedicated engine, please consult with Rossum any changes in your schema to ensure the best results of AI training.
- Please do not change the field ID in case you have already annotated any documents. If you need to do that, please reach out to email@example.com. We are going to help you set up the schema correctly and make sure you will not lose any valuable data.
- For predefined fields (fields supported by one of our pre-trained engines), you should edit only the Label.
- If you need to modify the field ID for business reasons, RIR field names should remain unchanged (predefined).
- When you set up the schema for the first time, RIR field names of custom fields can be empty (your AI Training Specialist can add them later).
- When you add RIR field names to a custom field:
- for header fields: it should be the same as the field ID
- for line items: it should start with “table_column_” (e.g. “table_column_material_number”)
- If you are not using hidden fields please remove them from the schema completely.
Line item fields:
- Field ID should start with “item_” e.g. “item_code”.
- RIR field names should start with “table_column_” e.g. “table_column_description”.
- The tuple should be set up correctly and the RIR field names of the tuple (usually, “line_items”) need to be added as well.
Fields filled by extension:
- If the field is filled by extensions, please keep the RIR field empty, because a prediction is not wanted/needed.
- If the value is in some cases present on the document and in other cases filled by an extension, it is recommended to have 2 fields. One field that is annotated (value is taken from the document) and another one for the value populated by an extension.