From this article you will learn:
- What is the Duplicate Handling extension
- What are the available configuration parameters
- What are the common use case configurations
- How to set up Duplicate Handling extension
What is the Duplicate Handling extension?
Organizations deal with a large number of incoming documents on a daily basis. Sometimes, these documents can be duplicated, leading to inefficiencies and a waste of resources.
For instance, multiple departments might process the same document, leading to redundant work. Also, in some cases, duplicates can lead to errors in data processing or decision-making.
Out of the box, without any configuration, Rossum detects duplicates by comparing the incoming file hash to previously received documents. To detect duplicates based on custom rules, Rossum provides a configurable extension. You can base the custom detection on the following:
- file hash
- extracted field values
- document attributes
Once the extension detects a duplicate, it can take actions such as:
- fill field values,
- forward the duplicate annotation to a different queue/status,
- mark the annotation as duplicate,
- show a message with custom text on the document,
- stop automation.
Please keep in mind that duplicates detected by the extension are included in the billing.
What are the available configuration parameters?
Configurations serve as a fundamental component of the duplicate handling extension, providing you with the ability to fine-tune and personalize the functionality of your duplicate detection workflow.
Example:
...
"trigger_events": ["annotation_status"],
"trigger_actions": ["changed"],
"statuses": ["importing.to_review", "to_review.postponed"],
"logic": [....]
...
Availale parameters include the following elements:
Trigger Events
trigger_events
define when the configuration should be activated. By default, it includes “annotation_status
” and “annotation_content
“.
Trigger Actions
trigger_actions
dictate when the configuration is triggered. Possible actions include “changed,” “initialize
,” “updated
,” “started
,” “confirm
,” “export
,” and “user_update
.”
Statuses
statuses
is a list of annotation status couples in the format {previous status}.{current status}
or {previous status}->{current status}
. They define when the processing logic should come into play.
Logic
logic
is the core of our duplicate handling extension. It is where we define the rules, scope, matching flow, timestamp and actions that define the duplicate detection process:
- Rules
rules
are the criteria used to detect duplicates. There are three types of detection rules: field
(based on extracted field values), filename
(based on the filename), and relation
(based on file hash).
Example:
...
"rules": [
{
"id": 1,
"attribute": "field",
"field_schema_id": "sender_name"
},
{
"id": 2,
"attribute": "filename"
},
{
"id": 3,
"attribute": "relation"
}]
...
- Matching Flow
matching_flow
defines how to combine rules for duplicate detection. You can specify a list of rule IDs or use logical "and
” operations between rule IDs. Each list element acts as a logical “or
” operation.
Example:
...
"matching_flow": ["1and2", "4"],
...
- Scope
scope
determines where the extension looks for duplicates. You have three options: “queue
,” “workspace
,” or “organization
.” You can further specify the ID of the chosen option and the statuses of the annotations to consider during detection.
Default setup:
{ "object": "queue", "statuses": [ "importing", "failed_import", "split", "to_review", "reviewing", "in_workflow", "confirmed", "rejected", "exporting", "exported", "failed_export", "postponed", "deleted", "purged" ] }
Example:
...
"scope": {
"object": "queue"
},
...
- Timestamp
timestamp
sets the time range in the past during which the extension searches for duplicates. You can specify the “action
,” which defines the date condition, and the “timespan
” in days for the chosen date condition.
Example:
...
"timestamp": {
"action": "arrived_at_after",
"timespan": 60
},
...
- Actions
actions
are a set of operations to be executed when duplicates are detected. Each actions
object includes:
type
: This specifies the action to be performed, with options such as “fill_field,” “forward_annotation,” “mark_duplicate,” “show_message,” “stop_automation,” and “apply_label.”
fill_only_if_empty
: If set to “true,” the field will be filled only if it was initially empty. The default value is “false.”
field_to_fill
: This is the schema ID of the datapoint in the Rossum schema where a custom value is inserted when a duplicate is found.
value_to_fill
: This is the custom value to be filled into the datapoint specified in “field_to_fill
.” You can use the “%ANNOTATION_ID%
” expression to populate it with a list of detected duplicates.
Example:
...
"actions": [{
"type": "show_message",
"message_type": "error",
"message": "Duplicates detected: %ANNOTATION_ID%"
}]
...
For detailed documentation regarding the configuration JSON, please see the technical documentation.
Common use case configurations for Duplicate Handling Extension
Detecting duplicate documents based on field values
The following configuration will detect incoming duplicates with matching invoice_id
and sender_name
fields against already processed documents in the same queue. In this case we will be using "attribute": "field"
to base our duplicate handling on the extracted data. You can also configure the detection scope to detect duplicates across queues/workspaces. When a duplicate document is detected, an error message will be shown on the document.
{
"configurations": [{
"logic": [{
"rules": [{
"id": 1,
"attribute": "field",
"field_schema_id": "invoice_id"
},
{
"id": 2,
"attribute": "field",
"field_schema_id": "sender_name"
}
],
"scope": {
"object": "queue"
},
"matching_flow": ["1and2"],
"actions": [{
"type": "show_message",
"message_type": "error",
"message": "Duplicates detected: %ANNOTATION_ID%"
}]
}]
}]
}
Detecting Duplicate Documents in a Complex Setup
This setup finds duplicate documents in a specific queue (ID 12345) by comparing invoice_id, sender_name, filename, and file hash with documents that have already been processed in this queue. It focuses on annotations with specific annotation_status received in the last 60 days. The extension is triggered only when the annotation_status
changes from ‘importing
‘ to ‘to_review
‘ or from ‘to_review
‘ to ‘postponed
‘. When duplicates are identified, a custom datapoint with schema ID(‘duplicate’) will be populated with a list of detected duplicates, the document will be forwarded to target queue (ID 123456) to a ‘postponed
‘ status, it will be flagged as a duplicate, and an error message will be displayed on the document.
{
"configurations": [
{
"trigger_events": ["annotation_status"],
"trigger_actions": ["changed"],
"statuses": ["importing.to_review", "to_review.postponed"],
"logic": [
{
"matching_flow": ["1and2and3", "4"],
"rules": [
{
"id": 1,
"attribute": "field",
"field_schema_id": "invoice_id"
},
{
"id": 2,
"attribute": "field",
"field_schema_id": "sender_name"
},
{
"id": 3,
"attribute": "filename"
},
{
"id": 4,
"attribute": "relation"
}
],
"scope": {
"object": "queue",
"ids": [12345],
"statuses": ["confirmed", "exported", "deleted"]
},
"timestamp": {
"action": "arrived_at_after",
"timespan": 60
},
"actions": [
{
"type": "fill_field",
"field_to_fill": "duplicate",
"value_to_fill": "Duplicate of %ANNOTATION_ID%"
},
{
"type": "forward_annotation",
"target_queue": 123456,
"target_status": "postponed"
},
{
"type": "mark_duplicate",
"message": "Marked as duplicate"
},
{
"type": "show_message",
"message_type": "error",
"message": "Detected {duplicate_ids | length} duplicates"
}
]
}
]
}
]
}
How to set up Duplicate Handling extension
Setting up the extension itself takes a few simple steps.
Step 1: Activate Duplicate Handling in the Rossum Store
To enable Duplicate Handling, go to the Rossum application and:
- Click on the Extensions tab at the top of the app.
- Click on the Rossum store option to display all the available extensions.
- Select the “Duplicate Handling” extension tile.
- Click “Try extension”.
Step 2: Specify to which queue(s) you want to add this extension
In the “Rossum Store Extension Settings,” scroll down to “Queues” and select the queue(s) in which you want to use the function. Please remember to save your changes once you’ve chosen the desired queues.

Step 3: Set up the configuration
You can set it up using the configuration field in the UI or using the settings attribute of the hook API object. The configuration is in JSON format (see the interactive documentation of the available parameters here).
