From this article you will learn:
- What is a Rossum Document Splitting extension
- What are the common use case configurations
- How to set up Document Splitting extension
- What are the available configuration parameters
What is the Document Splitting extension, and how does it work
Why is the Document Splitting extension so useful? In some business operations, getting multiple paper documents and scanning them into a single PDF file can often happen. Then you may receive documents in your Rossum app, where a single file contains many shorter documents. And as you know, Rossum performs data extraction on a document-level basis.
Rossum’s core feature allows you to manually split files into multiple documents. But the Document Splitting function extends the product’s core powers. It enables the reliable automation of document splitting based on configurable rules, such as:
- Fixed number of pages,
- Presence of text,
- Configuration to certain business partners, e.g., for a specific vendor, split a file based on the knowledge of its content and layout.
The splitting function triggers after the AI extracts data, evaluates conditions, and applies rules when the condition is matched. So you can set it up to suggest splits for the user to review or to split the document directly.
We highly recommend working in the suggestions mode until the reliability of the configuration is confirmed by processing multiple documents. It is because you won’t be able to merge back the split documents in Rossum UI. Also, please note that you must handle the wrong splits by merging documents outside of Rossum.
Common use case configurations for Document Splitting
You can copy and modify the following examples for easy setup.
1. Split based on a fixed number of pages
Splitting based on a fixed number of pages is useful when files contain multiple fixed-length documents. For example, when you receive files with many one-page documents.
Example:
In this example, we know that supplier “Milk Company” always sends us files with multiple documents, each only one page long. That is why our configuration will split every multipage file from that vendor into one-page documents. So if we receive a 3-page long file from the ‘Milk Company’ in the queue, we will get three separate processed one-page long documents.
{
"configurations": [
{
"rule": "number_of_pages",
"suggest": true,
"condition": "{sender_name} == 'Milk Company'",
"rule_params": {
"number_of_pages": 1
}
}
]
}
2. Split based on the presence of text on the page
Splitting based on the text’s presence is useful when the specific text on the page marks the beginning of the new document. For example, the presence of “Page: 1” can imply the start of a new document within a file.
Example:
The following configuration will split the document based on the presence of the text “PAGE: 1 “. So if the function finds the text on the page, it suggests (or performs) the split between the current and previous pages. Also, the function considers a page part of the longer document if no such text exists. That is how this rule allows you to split files holding documents of a varying number of pages.
In this example, we limit the application of the rule by condition to a specific vendor to avoid splitting all documents by this rule.
So if the ‘Rossum.ai’ vendor sends a 3-page file to the queue, with the text “PAGE: 1” on the first and third page, the function will split the document into two. And then the first document will contain pages 1 and 2, and the second will have page 3 of the original file.
{
"configurations": [
{
"rule": "text_search",
"suggest": true,
"condition": "{sender_name} == 'Rossum.ai'",
"rule_params": {
"phrase": "PAGE: 1",
"tolerance": 0
}
}
]
}
3. Split based on the value of extracted field
Splitting based on the value of an extracted field is useful when no text on the page marks the start of a new document. For instance, the extension can split a multipage document each time it finds an AI-predicted document_id
.
Example:
Here the extension uses values extracted from the AI- predicted document_id (Invoice Number) field to suggest the splits. So the rule is based on the knowledge that every file page contains the document_id.
Rossum usually predicts only one, the most confident occurrence of each header field in the file. Therefore to split the document each time a field occurs, we need to add a hidden multivalue field to the schema. Please note that this field will not be visible to the user. Still, it will contain values of all occurrences of document_id in the file (see the example JSON specification of the field below).
The splitting function will suggest the splits based on the changes in the values. So let’s say there is a file where 3 pages contain two invoices. Then the first page is one invoice, and the remaining two pages are the second. In that case, the extracted Invoice numbers on the pages might look like this:
- Page 1: document_id = 123
- Page 2: document_id = 345
- Page 3: document_id = 345
The function will correctly suggest the split between pages 1 and 2 because this is where the extracted values change. There will be no division between pages 2 and 3 because the value on both is the same.
Schema configuration
{
"category": "multivalue",
"id": "multivalue_split_document_id",
"label": "Multi Document ID",
"hidden": true,
"children": {
"rir_field_names": [
"document_id"
],
"default_value": null,
"category": "datapoint",
"id": "split_document_id",
"label": "Multi Document ID",
"hidden": true,
"type": "string",
"can_export": false
},
"min_occurrences": null,
"max_occurrences": null,
"default_value": null,
"show_grid_by_default": false,
"rir_field_names": null
}
Document Splitting configuration
{
"configurations": [
{
"rule": "field_value_comparison",
"suggest": true,
"condition": "{sender_name} == 'Startup from Prague'",
"rule_params": {
"schema_id": "multivalue_split_document_id",
"skip_empty": true
}
}
]
}
How to set up the Document Splitting extension
Setting it up the itself takes a few simple steps.
Step 1: Prepare your queues and schemas
First, you need to identify the queue(s) with the documents that require splitting. Also, if documents will be split based on the AI-predicted fields, identify the fields’ schema ids that will be used in the configuration and add multivalue fields.
Step 2: Activate Document Splitting in the Rossum Store
To enable Document Splitting, go to the Rossum application and:
- Click on the Extensions tab at the top of the app.
- Click on the Rossum store option to display all the available extensions.
- Select the “Document Splitting” extension tile.
- Click “Try extension.”

Step 3: Specify to which queue(s) you want to add this extension
Once in the “Rossum Store Extension Settings,” scroll down to “Queues” and select the queue(s) in which you want to use the Document Splitting. Please remember to save your changes once you’ve chosen the desired queues.

Step 4: Set up the Document Splitting
You can set it up using the configuration field in the UI or using the settings attribute of the hook API object. The configuration is in JSON format. You can copy one of the above examples and adjust it to your needs or create a new setup (see the description of the available parameters below).

Configuration parameters
This is a list of configurations you can specify for custom splitting logic to apply to different documents based on the defined conditions.
Attribute | Type | Required | Default value | Description |
configurations | list | true | List with objects defining splitting behavior. | |
condition | string | false | Condition for the splitting rule based on extracted data. | |
text_condition | object | false | Condition for the splitting rule based on presence of the text in the file. | |
phrases | list | true (if text_condition is used) | List of phrases to be searched on in the document for purposes of the text_condition . | |
logic | string | true (if text_condition is used) | Evaluation’s logic of found/not found phrases in the document for purposes of the text_condition ‘s evaluation. | |
suggest | boolean | false | false | If true , the function will suggest the splits. If false the function will automatically split documents. |
rule | string | true | Type of the splitting rule. | |
rule_params | object | true | Section containing additional parameters needed for the selected rule. Can contain one or more parameters. | |
number_of_pages | int | true | Number of pages after which the splitting separator should be inserted. Applicable to the number_of_pages rule type only. | |
schema_id | string | true | Multivalue header field’s schema_id used for the splitting.Applicable to the field_occurrence and field_value_comparison rule types only. | |
skip_empty | boolean | false | true | If true , pages without detected field will be “skipped” – appended to the previous pages. If false , pages without detected field will be considered a separate document.Applicable to the field_occurrence and field_value_comparison rule types only. |
phrase | string | true | Text that will be searched on the document and used for splitting. | |
tolerance | int | false | Maximum edit distance between the searched phrase and text in the file. See full description here. | |
split_before | bool | false | true | If true , the function will insert the splitting separator before pages where defined text was detected. If false it will insert the separator after those pages. |
Types of splitting rules
number_of_pages
– Number of pages after which the function will insert the splitting separator. As a result, each created document will have precisely that specified number of pages.
field_occurrence
– Splits are done based on the occurrence of a defined field. This field identifies either the document’s first or last page. Splitting separators will always be inserted before/after pages with the detected field.
text_search
– Splits are done based on the occurrence of some text on a document. It uses the Rossum search endpoint to get the search results. Splitting separators will always be inserted before/after pages with detected text.
field_value_comparison
– The function will consider the occurrence of the fields and extracted values when creating splits/suggestions. If detected fields are on different pages but have the same value, the function will consider it one document and not split those pages.
Condition
Condition is a Python expression with schema_id
placeholders that are extracted to the values before the evaluation of the condition. If True
, the corresponding splitting rule should be applied. If False
, the splitting rule is skipped.
Condition references different annotation content values by their schema_id placeholders: {schema_id}
. The schema_id has to be inside curly brackets {schema_id}
for the function to recognize and evaluate it properly. Different types of schema objects are handled differently:
- multivalue schema_id – Schema object with category
multivalue
. Multivalues are allowed only in combination withlen: len({multivalue_schema_id})
. Will always be evaluated aslen(multivalue["children"])
– number of rows (values) inside multivalue.
- datapoints inside multivalue – Schema object with category
datapoint
that is located inside multivalue. Such datapoints are allowed only in combination withall
orany
– for example,all({li_datapoint_schema_id})
orany(map(lambda x: x > 0, {li_datapoint_schema_id}))
. Because these datapoints can occur multiple times in the annotation content, they are replaced with list of corresponding values –all(["value1", "value2", "value3"])
.
- simple datapoints – Simple schema object with category
datapoint
. Datapoint schema_ids are replaced with corresponding value from annotation content (string or float).
Text condition
Text condition is a list of phrases
Rossum tries to find on the document. If evaluated as True
, Rossum should apply the corresponding splitting rule. If False
, Rossum will skip the splitting rule.
The function performs a search for all phrases
in the text_condition
config. All search results are combined based on defined logic
. If:
"logic": "any"
– At least one phrase must be present on the document."logic": "all"
– All phrases must be present on the document.
"text_condition": {
"phrases": ["Proof of Purchase", "Invoice"],
"logic": "any"
}