Rossum Elis Data Extraction API

Introduction

This Elis Data Extraction API provides an API for invoice data extraction. It allows users to submit documents (PDF or image) for data extraction by Rossum's neural networks and retrieve structured machine-readable results asynchronously. This API provides just the raw interface to our extraction engine; refer to the Elis Document Management API for rich features like document lifetime management, integrations and particularly the verification user interface.

The public API endpoint for invoices in all supported languages is at https://all.rir.rossum.ai/ and all paths below are relative to this endpoint.

Anyone may sign up to use this API at https://rossum.ai/developers. Some users may be assigned other API endpoints that are specific to a combination of language and country and receive their endpoint details directly from Rossum.

Document attributes classification

The Elis Invoice Extractor currently automatically determine 3 attributes of an invoice:

  • Language it is written in. Supported languages:
    • Czech, Slovak, English, German
  • Currency in which it is ought to be paid. Supported currencies:
    • Czech koruna, Danish krone, Norwegian krone, Swedish krona, British Pound, Euro, and US Dollar.
  • Invoice type. Supported types:
    • Tax invoice, Credit note, Proforma

Extracted field types

The Elis Invoice Extractor currently automatically extracts the following fields at the all endpoint, subject to ongoing expansion.

API Authentication

Each request must be accompanied with an authentication header containing unique per-user token. Currently, each user has a fixed assigned token. The header should be as follows (replace the example with your actual key):

Authorization: secret_key xxxxxxxxxxxxxxxxxxxxxx_YOUR_ELIS_API_KEY_xxxxxxxxxxxxxxxxxxxxxxx

To request an API key, sign up for free at https://rossum.ai/developers.

API Reference

The basic communication format is JSON. The API is asynchronous.

Check the examples of using the API from various popular languages:

Document Management

First you can post a new document for asynchronous processing. Each analyzed document is referenced by a unique document id. Document's lifetime is currently rather long, but users should avoid relying on document id lifetime longer than 10 minutes after processing is finished.

Tip: web preview

Tip: You can visualize the extracted data on Rossum's homepage! We have a simple builtin JavaScript API client. Open a URL in this format:

https://rossum.ai/document/{document_id}?apikey={api_key}

Include &apiurl=https://XXX.rir.rossum.ai if you use a non-default API endpoint.

Upload document - POST /document

Suggestions:

In order to get the best results from Elis the documents should be in A4 format, at least 150 DPI (in case of scans/photos). It does not mean Elis is not capable of extracting from other formats, but it definitely makes her perform the best.

Restrictions:

  • number of pages: current restriction for the maximum number of pages per invoice is 32. Any document with more pages than this fails to be processed.
  • image dimensions: any image larger than 10.000 pixels on any side is not going to be processed either.

Parameters:

  • locale (optional)
    • A hint for Elis that may help her to extract certain fields, which may depend on the locale, correctly. For example, in US the typical date format is mm.dd.yyyy whilst in Czech it is dd.mm.yyyy. So date such as 12. 6. 2018 when locale=cz_CZ is specified is going to be extracted as 12th of June, while when locale=en_US is used the date is going to be extracted as 6th of December 2018.
    • e.g. cs_CZ, en_US, ...
  • tables (optional)
    • boolean (true/false), default true
    • allows disabling extraction of tables (may be faster when we don't need them)
    • since 2018-11-03

Request:

POST /document[?locale=cz_CZ|en_US|...][&tables=true|false]

multipart/form-data ...

Response:

200 OK
{
  "status": "processing",
  "id": "{document_id}"
}

Upload a file and get back a document ID. Expects exactly one uploaded file with field name file, data directly from a browser as multipart/form-data. In addition to PDF files, PNG and JPEG images are supported as submissions. Content type is inferred automatically if not specified.

Example using curl (note the @-sign before the filename!):

curl -H "Authorization: secret_key YOURKEY" -X POST -F file=@invoice.pdf https://all.rir.rossum.ai/document

Get Document - GET /document/{document_id}

Parameters:

  • filter (optional)
    • best - retrieve a subset of extracted fields filtered for only the high quality data; e.g., lower score fields are filtered in favor of higher score ones; the exact set of filters applied may change over time
    • all - return the complete set of extracted fields, even lower quality ones; the client has to post-process the fields appropriately
    • default: best

Request:

GET /document/{document_id}[?filter=best|all]
Content-Type: application/json

Response:

200 OK
{
  "status": "processing",
  "id": "{document_id}"
}
200 OK
{
  "status": "error",
  "message": "Cannot process input file"
}
200 OK
{
  "status": "ready",
  "message": "The analysis is complete",
  "original_pages": [
    "https://unsplash.it/595/842"
  ],
  "previews": [
    "https://unsplash.it/595/842"
  ],
  "fields": [
    {
      "name": "account_num",
      "title": "Bank Account",
      "content": "1234567890/0100",
      "value": "1234567890/0100",
      "value_type": "text",
      "bbox": [
        243,
        117,
        323,
        145
      ],
      "checks": {
        "checksum": "good"
      },
      "score": 0.9587
    },
    {
      "name": "amount_due",
      "title": "Amount Due",
      "content": "3 705,50",
      "value": "3705.50",
      "value_type": "number",
      "bbox": [
        420,
        516,
        580,
        532
      ],
      "score": 0.989
    }
  ],
  "text_lines": {
    "content": [
      [
        "ICompnany Name INVOICE",
        "Invoice Templale by Verlex42.corn E 2010 Verlex42 LLC"
      ]
    ],
    "name": "full_text",
    "title": "Rough Content"
  }
}

Processing states

Documents begin in the processing status, where they are either queued or extraction is in progress. Users are expected to periodically check documents status until they leave the processing phase. Minimum interval between document status queries should be at least 500ms.

Example using curl:

curl -H "Authorization: secret_key YOURKEY" https://all.rir.rossum.ai/document/3af21605a5bb48ef79f752c8?filter=best

Result

The returned document object has at least the following attributes:

  • status - processing, error or ready
  • original_pages - list of rendered page URLs (long-term lifetime of the URL is not guaranteed)
  • previews - list of visual preview URLs, one per page (long-term lifetime of the URL is not guaranteed)
  • fields - a list of data fields extracted from the document
  • text_lines - an object where the content attribute is a list of pages, each page represented as a list of lines roughly recognized in the document; this output is based on a less precise OCR recognition than we use for data capture
  • language - the language which the document was written in, one of ces (Czech), eng (English), deu (German), or other.
  • currency - the currency which the invoice is to be paid in, one of czk (Czech koruna), gbp (British Pound), eur (Euro), usd (US Dollar), or other.
  • invoice_type - the type of invoice, one of of tax_invoice, credit_note, proforma, or other.

A processing invoice may or may not include a preview or (possibly incomplete) fields attribute. A ready invoice always contains both attributes. It is recommended that interactive API users progressively show as much as available even for processing invoices.

An error invoice always contains a message attribute (and typically nothing else). Error may be returned when we cannot process the invoice file format or when we can a priori determine that our result would be of too low quality.

Extracted fields

Each data field in the fields list is an object with these attributes:

  • name - machine-readable name of the data field; see a list of supported data fields
  • title (deprecated) - user-readable en_US name of the data field, e.g. "Variable symbol"
  • content - either a string with the extracted value of the field (as a verbatim string copied from the source invoice), or a list of further fields in case of a group field (see below)
  • value - a string containing a machine-readable parsed field value that should be interpreted based on the value_type field
  • value_type - one of the following:
    • "number": value is a numerical value, either a %d integer (e.g. for numerical identifiers) or a %d.%d fixed-point decimal (e.g. for amounts and rates)
    • "date": value is an YYYY-MM-DD ISO-format date
    • "text": value is an arbitrary string (note that this string may still differ from content as a value that is more sanitized for further processing, e.g. with stripped spaces and other characters in case of data fields with specified semantics, e.g. bank account numbers contain only digits and dashes)
  • Note that more value types may appear in the future.
  • bbox - list of coordinates of the bounding box of the field in the document, in the order x1, y1, x2, y2 (left, top, right, bottom)
  • score - confidence (estimated probability) that this field has been extracted correctly (see also below)
  • checks - dictionary of automated checks that were performed to verify the extracted value, each key associated with the check result (good, bad or other values in the future)

Group fields are used to logically group a set of values, chiefly tax details for a particular tax level.

In the filter=all mode, mutliple field instances of the same type may be extracted and returned. In select cases (group fields and address lines), all instances of a field are relevant in principle. In case of the remaining field types, most users will want to select the one field instance with the highest score value for each field type - to get the fields already consolidated this way, pass filter=best (the default).

Confidence

The score and checks attributes are provided to allow users to fully automate processing of some fields with high confidence. Do not assume a particular set of checks as we are constantly expanding them. You may fully automate the processing of a field without user verification in case all check values are good, or no check value is bad and score is above your threshold (e.g. 0.975 for maximum 2.5% error rate). Note that different check values may be introduced in the future, do not assume just good and bad - however, it is future-proof to adhere to the two conditions above.

Statuses of many document in bulk - POST /documents/status

POST /documents/status?[details=true/false]

Get processing status of a larger number of documents in bulk for given user.

It's more efficient than calling /document/<job_id> on each document and can be called even with a large number of documents.

Statuses are the following. In addition there's state "not_found" for documents that would return 404 by themselves.

You can query just if the document finished which is quite fast, or with details=true also if the finished document succeeded or failed (slower).

  • "processing" - waiting in queue or being processed
  • "finished" - finished (details=false)
  • "ready" - successfully finished (details=true)
  • "error" - finished with failure (details=true)
  • "not_found" - non-existing job id or job not owned by the user

Request:

POST /documents/status

{"job_ids": ["foo", "bar", "baz", "quack"]}

Response:

HTTP 200 OK

{
  "foo": "processing",
  "bar": "finished",
  "baz": "finished",
  "quack": "not_found"
}

Request:

POST /documents/status?details=true

{"job_ids": ["foo", "bar", "baz", "quack"]}

Response:

HTTP 200 OK

{
  "foo": "processing",
  "bar": "ready",
  "baz": "error",
  "quack": "not_found"
}

Send feedback - PUT document/{doc_id}/feedback

Request:

PUT document/{doc_id}/feedback
Content-Type: application/json

{
  "result": "incorrect",
  "fields": [
    "sender_ic",
    "recipient_ic"
  ]
}

Response:

200 OK
Content-Type: application/json

{}

Notify us of an (un)successful extraction based on human validation. In case of a mistake, names of fields where we made an error (or perhaps were superfluous or missing) should be passed.

Body of the request shall be an application/json object with these attributes:

  • result - one of correct or incorrect
  • fields (if result is incorrect) - list of field names

Note: This is an experimental endpoint. We currently consider feedback only from selected customers, please contact support@rossum.ai for details.