How to set up the Address Prefilling extension from Rossum Store

In order to prevent fraud and duplicate processing, it is often necessary to match a document’s sender with the internal database of approved vendors. Master Data Hub is the default Rossum extension for this purpose. Rossum’s AI engines extract addresses from documents as a single value. It is often necessary to split those addresses into individual values such as city, street, and zip code, and compare these individual values with vendor management system database.

The Address Prefilling extension allows you to extract address data from your documents. It transforms a series of words and numbers that represent an address into values that represent the individual parts of that address. You can then map these values into the predefined fields of your schema.

At the core of this extension is the open-source NLP library called “libpostal.” This library is described in detail here. Additionally, this extension uses an API built on top of the libpostal library. The code can be found here.

It should be noted that this extension is not meant to validate or look up addresses.

Configuration example

The sample configuration below illustrates the use of configuration parameters to control transformations and mapping.

Input address:

Royal Suites, Floor 8, 81 Manchester Road NOTTINGHAM NG49 9BL

Libpostal output:

{
  'house': 'royal suites', 
  'level': 'floor 8', 
  'house_number': '81', 
  'road': 'manchester road', 
  'city': 'nottingham', 
  'postcode': 'ng49 9bl'
}

Configuration sample:

{
  "rules": [
    {
      "queue_ids": [
        123, 124
      ],
      "disable_user_updates" : true,
      "source_address_schema_id": "sender_address",
      "field_mappings": [
        {
          "target_schema_id": "sender_address_city",
          "address_parts": [
            "city"
          ],
          "case": "first_letter"
        },
        {
          "target_schema_id": "sender_address_state",
          "address_parts": [
            "state"
          ],
          "case": "first_letters"
        },
        {
          "target_schema_id": "sender_address_line_1",
          "address_parts": [
            "po_box"
          ]
        },
        {
          "target_schema_id": "sender_address_line_2",
          "address_parts": [
            "house",
            "category",
            "unit",
            "level"
          ],
          "separator": ",",
          "case": "first_letters"
        },
        {
          "target_schema_id": "sender_address_street",
          "address_parts": [
            "road"
          ],
          "case": "first_letter"
        },
        {
          "target_schema_id": "sender_address_country",
          "address_parts": [
            "country"
          ],
          "case": "first_letters"
        },
        {
          "target_schema_id": "sender_address_postal_code",
          "address_parts": [
            "postcode"
          ],
	  "case" : "upper"
        }
      ]
    }
  ]
}

The following is the result of applying the rule shown above (simplified to make it easier to read):

sender_address_city : "Nottingham", 
sender_address_state : "", 
sender_address_line_1: "", 
sender_address_line_2: "Royal suites,Floor8", 
sender_address_street: "Manchester road", 
sender_address_country: "", 
sender_address_postal_code: "NG49 9BL"

These values are then stored into the relevant schema fields of the document’s annotation data.

Setting up the extension

To install and set up the extension, follow these steps:

  1. Prepare your queues and schemas.
  2. Activate Address Prefilling in the Rossum Store.
  3. Specify the queue(s) the extension is going to be used for.
  4. Set up the rules.

Step 1: Prepare your queues and schemas

First, identify the queue(s) with the documents that require Address Prefilling. Next, identify the schema IDs for the fields that contain addresses and those that will be used to store the parsed address elements.

If the used schema doesn’t yet contain fields for individual address parts, create those fields. For information on schema editing refer to this guide.

Sample schema fields

Following fields definition can be used for specifying the usual address fields. Alternatively, here is a complete schema example based on the Rossum default schema for EU invoices which can be copied and pasted in the schema editor.

      {
        "rir_field_names": [],
        "constraints": {
          "required": false
        },
        "default_value": null,
        "category": "datapoint",
        "id": "sender_address_city",
        "label": "Vendor City",
        "hidden": false,
        "type": "string",
        "can_export": true
      },
      {
        "rir_field_names": [],
        "constraints": {
          "required": false
        },
        "default_value": null,
        "category": "datapoint",
        "id": "sender_address_state",
        "label": "Vendor State",
        "hidden": false,
        "type": "string",
        "can_export": true
      },
      {
        "rir_field_names": [],
        "constraints": {
          "required": false
        },
        "default_value": null,
        "category": "datapoint",
        "id": "sender_address_line_1",
        "label": "Vendor Line 1",
        "hidden": false,
        "type": "string",
        "can_export": true
      },
      {
        "rir_field_names": [],
        "constraints": {
          "required": false
        },
        "default_value": null,
        "category": "datapoint",
        "id": "sender_address_line_2",
        "label": "Vendor Line 2",
        "hidden": false,
        "type": "string",
        "can_export": true
      },
      {
        "rir_field_names": [],
        "constraints": {
          "required": false
        },
        "default_value": null,
        "category": "datapoint",
        "id": "sender_address_street",
        "label": "Vendor Street",
        "hidden": false,
        "type": "string",
        "can_export": true
      },
      {
        "rir_field_names": [],
        "constraints": {
          "required": false
        },
        "default_value": null,
        "category": "datapoint",
        "id": "sender_address_country",
        "label": "Vendor Country",
        "hidden": false,
        "type": "string",
        "can_export": true
      },
      {
        "rir_field_names": [],
        "constraints": {
          "required": false
        },
        "default_value": null,
        "category": "datapoint",
        "id": "sender_address_postal_code",
        "label": "Vendor Postal Code",
        "hidden": false,
        "type": "string",
        "can_export": true
      },
      {
        "rir_field_names": [],
        "constraints": {
          "required": false
        },
        "default_value": null,
        "category": "datapoint",
        "id": "sender_address_house_number",
        "label": "Vendor House Number",
        "hidden": false,
        "type": "string",
        "can_export": true
      },

Step 2: Activate Address Prefilling in the Rossum Store

In order to activate Address Prefilling, follow these steps:

  1. From the main menu, click Extensions to access the Rossum Store page.
  2. Once in the Rossum Store, Address Prefilling should be visible. Click View all if it is not.
  3. Click the Address Prefilling extension tile.
  4. Click Try extension.
Try extension

Step 3: Specify the queue(s) where the extension is going to be used

Once in the extension settings, scroll down to Queues and select the queue(s) for which the extension should be used.

Extension queue settings

Step 4: Set up the actions and transformations

The extension is configured through the configuration field in the UI or by using the settings attribute of the hook API object. The configuration is in JSON format (see the description of the available parameters below).

This configuration consists of rules that define a mapping and possible transformations (more on that later) between address parts and schema fields. Each rule has:

  • A list of queue IDs in order to use a single instance of the extension for all queues
  • Source/target field mappings
  • Letter case transformation
Extension configuration

The full list of available parameters is shown below:

RootParam nameMandatoryDescription
 rulesyesA list of rules to be executed by the extension. Below is a description of the rules parameters.
rulesqueue_idsnoThe queue IDs for which the rule is enabled, provided the extension is attached to the queue in the extension configuration.
rulessource_address_schema_idyesSchema_id of a field containing the original address value.
rulesdisable_user_updatesnoDefault false. If set to true user updates of any field_mappings.target_schema_id triggers the rule to update all of the fields again. Essentially prevents user updates of fields listed in field_mappings.
rulesfield_mappingsyesA list of mappings of the libpostal values and annotation fields.
field_mappingstarget_schema_idyesThe schema ID into which the individual value(s) will be placed.
field_mappingsaddress_partsyesList of field(s) – address part(s) – to be mapped into target_schema_id. Possible values are listed below.
field_mappingsseparatornoDefault “ “. If address_parts contains multiple values, separator is used to concatenate the values.
field_mappingscasenoThis extension does not preserve the original letter case of the address. If case is not set, lower case will be used. Options:upper – all characters are converted to upper case first_letter – the first letter of the first word is converted to upper case.first_letters – the first letter of each word is converted to upper case.
Configuration description

Possible address values

The possible values of address_parts as listed in the documentation of the libpostal library can be found here. These include:

  • house: a venue’s name, such as the “Brooklyn Academy of Music” and a building’s name, such as the “Empire State Building”
  • category: for category queries such as “restaurants”.
  • near: phrases like “in”, “near”, etc. used after a category phrase to help with parsing queries like “restaurants in Brooklyn”
  • house_number: usually refers to the external (street-facing) building number. This may be a compound, hyphenated number that includes a block or apartment number (as in Japan). libpostal will just call it house_number for simplicity.
  • road: Street name(s).
  • unit: An apartment, unit, office, lot, or other secondary unit designator.
  • level: Expressions indicating a floor number e.g. “3rd Floor”, “Ground Floor”, etc.
  • staircase: Numbered/lettered staircase.
  • entrance: Numbered/lettered entrance.
  • po_box: A non-physical (mail-only) box normally found at a post office.
  • postcode: Postal codes used for mail sorting.
  • suburb: An unofficial neighbourhood name such as “Harlem”, “South Bronx”, “Crown Heights.”
  • city_district: These are usually districts or boroughs in a city that are intended for some official function, e.g. “Brooklyn” or “Hackney” or “Bratislava IV.”
  • city: Any human settlement including cities, towns, villages, hamlets, localities, etc.
  • island: A named island, such as “Maui.”
  • state_district: Usually a second-level administrative division or county.
  • state: The first-level administrative division. Scotland, Northern Ireland, Wales, and England in the UK are mapped to “state” as well (convention used in OSM, GeoPlanet, etc.)
  • country_region: A region within a country that is not a part of an organised political system.
  • country: Sovereign nations and their dependent territories; anything with an ISO-3166 code.
  • world_region: Currently only used for adding “West Indies” after the country name, a pattern commonly seen in English-speaking Caribbean countries, such as “Jamaica, West Indies”.

This software and documentation uses the following third-party open source libraries and parts of the original documentation:
libpostal
The MIT License (MIT)
Copyright (c) 2015 Openvenues
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Libpostal-rest
The MIT License (MIT)
Copyright (c) 2021 John Longanecker
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Automate data extraction from your documents with Artificial Intelligence.
Free trial