Rossum Data Extraction API


This Rossum Data Extraction API provides a low-level API for invoice data extraction.

This API is deprecated for general usage. Please refer to the Rossum Document Management API which is preferred for new applications built around Rossum's platform.

This API provides just the raw interface to our extraction engine; it allows users to just submit documents (PDF or image) for data extraction by Rossum's neural networks and retrieve structured machine-readable results asynchronously.

The public API endpoint for invoices in all supported languages is at and all paths below are relative to this endpoint.

Note that this API as well as the cognitive data capture capabilities of Rossum are constantly evolving. You may subscribe to the elis-api-announcements group for API update notifications.

Document attributes classification

The Rossum Invoice Extractor currently automatically determines multitude of the attributes of an invoice (e.g. language or currency).

Extracted field types

The Rossum Invoice Extractor currently automatically extracts the following fields at the all endpoint, subject to ongoing expansion.

Extracted tables

The Rossum Invoice Extractor currently automatically extracts line item table content and recognizes row and column types as detailed below.

API Authentication

Each request must be accompanied with an authentication header containing unique per-user token. Currently, each user has a fixed assigned token. The header should be as follows (replace the example with your actual key):

Authorization: secret_key xxxxxxxxxxxxxxxxxxxxxx_YOUR_DE_API_KEY_xxxxxxxxxxxxxxxxxxxxxxx

To request an API key, sign up for free at

API Reference

The basic communication format is JSON. The API is asynchronous.

Check the examples of using the API from various popular languages:

Document Management

First you can post a new document for asynchronous processing. Each analyzed document is referenced by a unique document id. Document's lifetime is currently rather long, but users should avoid relying on document id lifetime longer than 10 minutes after processing is finished.

Tip: web preview

Tip: You can visualize the extracted data on Rossum's homepage! We have a simple builtin JavaScript API client. Open a URL in this format:{document_id}?apikey={api_key}

Include &apiurl= if you use a non-default API endpoint.

Upload document - POST /document


In order to get the best results from Rossum the documents should be in A4 format, at least 150 DPI (in case of scans/photos). It does not mean Rossum is not capable of extracting from other formats, but it definitely makes it perform the best.


  • number of pages: current restriction for the maximum number of pages per invoice is 16. Any document with more pages than this fails to be processed.
  • image dimensions: any image larger than 10.000 pixels on any side is not going to be processed either.


  • locale (optional)
    • A hint for Rossum that may help it to extract certain fields, which may depend on the locale, correctly. For example, in US the typical date format is mm.dd.yyyy whilst in Czech it is So date such as 12. 6. 2018 when locale=cz_CZ is specified is going to be extracted as 12th of June, while when locale=en_US is used the date is going to be extracted as 6th of December 2018.
    • e.g. cs_CZ, en_US, ...
  • tables (optional)
    • boolean (true/false), default true
    • allows disabling extraction of tables (may be faster when we don't need them)
    • since 2018-11-03
  • rotation_deg (optional)
    • one of {0, 90, 180, 270}
    • specifies the angle by which all the pages within the uploaded document are rotated in a clockwise direction before processing
  • text_lines (optional)
    • boolean (true/false), default false
    • if true the extraction results will contain text_lines (see results description)
  • effective_page_count (optional)
    • int (positive integer), defaults to the number of pages in document
    • limits the extraction to the first effective_page_count pages of the document
    • if higher than the actual page count, the value gets clipped properly


POST /document[?locale=cz_CZ|en_US|...][&tables=true|false][&rotation_deg=0|90|180|270][&text_lines=true|false][&effective_page_count=1|2|...]

multipart/form-data ...


200 OK
  "status": "processing",
  "id": "{document_id}"

Upload a file and get back a document ID. Expects exactly one uploaded file with field name file, data directly from a browser as multipart/form-data. In addition to PDF files, PNG and JPEG images are supported as submissions. Content type is inferred automatically if not specified.

Example using curl (note the @-sign before the filename!):

curl -H "Authorization: secret_key YOURKEY" -X POST -F file=@invoice.pdf

Get Document - GET /document/{document_id}


  • filter (optional)
    • best - retrieve a subset of extracted fields filtered for only the high quality data; e.g., lower score fields are filtered in favor of higher score ones; the exact set of filters applied may change over time
    • all - return the complete set of extracted fields, even lower quality ones; the client has to post-process the fields appropriately
    • default: best


GET /document/{document_id}[?filter=best|all]
Content-Type: application/json


200 OK
  "status": "processing",
  "id": "{document_id}"
200 OK
  "status": "error",
  "message": "Cannot process input file"
200 OK
  "status": "ready",
  "message": "The analysis is complete",
  "original_pages": [
  "previews": [
  "fields": [
      "name": "account_num",
      "title": "Bank Account",
      "content": "1234567890/0100",
      "value": "1234567890/0100",
      "value_type": "text",
      "page": 0,
      "bbox": [ 243, 117, 323, 145 ],
      "checks": {
        "checksum": "good"
      "score": 0.9587
      "name": "amount_due",
      "title": "Amount Due",
      "content": "3 705,50",
      "value": "3705.50",
      "value_type": "number",
      "page": 0,
      "bbox": [ 420, 516, 580, 532 ],
      "score": 0.989
  "tables": [
  "text_lines": {
    "content": [
        "ICompnany Name INVOICE",
        "Invoice Templale by Verlex42.corn E 2010 Verlex42 LLC"
    "name": "full_text",
    "title": "Rough Content"

Processing states

Documents begin in the processing status, where they are either queued or extraction is in progress. Users are expected to periodically check documents status until they leave the processing phase. Minimum interval between document status queries should be at least 500ms.

Example using curl:

curl -H "Authorization: secret_key YOURKEY"


The returned document object has at least the following attributes:

  • status - processing, error or ready
  • original_pages - list of rendered page URLs (long-term lifetime of the URL is not guaranteed)
  • previews - list of visual preview URLs, one per page (long-term lifetime of the URL is not guaranteed)
  • fields - a list of data fields extracted from the document
  • attributes - list of attributes extracted from the document, see: Classification Attributes
  • tables - a list of tables extracted from the document - to learn more about the format, see: Table extraction
  • text_lines - present only when text_lines=true during upload; an object where the content attribute is a list of pages, each page represented as a list of lines roughly recognized in the document; this output is based on a less precise OCR recognition than we use for data capture
  • language - DEPRECATED (use attributes instead), see: Legacy Format
  • currency - DEPRECATED (use attributes instead), see: Legacy Format
  • invoice_type - DEPRECATED (use attributes instead), see: Legacy Format
  • payment_state - DEPRECATED (use attributes instead), see: Legacy Format

A processing invoice may or may not include a preview or (possibly incomplete) fields attribute. A ready invoice always contains both attributes. It is recommended that interactive API users progressively show as much as available even for processing invoices.

An error invoice always contains a message attribute (and typically nothing else). Error may be returned when we cannot process the invoice file format or when we can a priori determine that our result would be of too low quality.

Extracted fields

Each data field in the fields list is an object with these attributes:

  • name - machine-readable name of the data field; see a list of supported data fields
  • title (deprecated) - user-readable en_US name of the data field, e.g. "Variable symbol"
  • content - either a string with the extracted value of the field (as a verbatim string copied from the source invoice), or a list of further fields in case of a group field (see below)
  • value - a string containing a machine-readable parsed field value that should be interpreted based on the value_type field
  • value_type - one of the following:
    • "number": value is a numerical value, either a %d integer (e.g. for numerical identifiers) or a %d.%d fixed-point decimal (e.g. for amounts and rates)
    • "date": value is an YYYY-MM-DD ISO-format date
    • "text": value is an arbitrary string (note that this string may still differ from content as a value that is more sanitized for further processing, e.g. with stripped spaces and other characters in case of data fields with specified semantics, e.g. bank account numbers contain only digits and dashes)
  • Note that more value types may appear in the future.
  • page - an integer denoting the page on which the field appears in the document, starts at 0
  • bbox - list of coordinates of the bounding box of the field in the document, in the order x1, y1, x2, y2 (left, top, right, bottom)
  • score - confidence (estimated probability) that this field has been extracted correctly (see also below)
  • checks - dictionary of automated checks that were performed to verify the extracted value, each key associated with the check result (good, bad or other values in the future)

Group fields are used to logically group a set of values, chiefly tax details for a particular tax level.

In the filter=all mode, mutliple field instances of the same type may be extracted and returned. In select cases (group fields and address lines), all instances of a field are relevant in principle. In case of the remaining field types, most users will want to select the one field instance with the highest score value for each field type - to get the fields already consolidated this way, pass filter=best (the default).


The score and checks attributes are provided to allow users to fully automate processing of some fields with high confidence. Do not assume a particular set of checks as we are constantly expanding them. You may fully automate the processing of a field without user verification in case all check values are good, or no check value is bad and score is above your threshold (e.g. 0.975 for maximum 2.5% error rate). Note that different check values may be introduced in the future, do not assume just good and bad - however, it is future-proof to adhere to the two conditions above.

Statuses of many document in bulk - POST /documents/status

POST /documents/status?[details=true/false]

Get processing status of a larger number of documents in bulk for given user.

It's more efficient than calling /document/<job_id> on each document and can be called even with a large number of documents.

Statuses are the following. In addition there's state "not_found" for documents that would return 404 by themselves.

You can query just if the document finished which is quite fast, or with details=true also if the finished document succeeded or failed (slower).

  • "processing" - waiting in queue or being processed
  • "finished" - finished (details=false)
  • "ready" - successfully finished (details=true)
  • "error" - finished with failure (details=true)
  • "not_found" - non-existing job id or job not owned by the user


POST /documents/status

{"job_ids": ["foo", "bar", "baz", "quack"]}



  "foo": "processing",
  "bar": "finished",
  "baz": "finished",
  "quack": "not_found"


POST /documents/status?details=true

{"job_ids": ["foo", "bar", "baz", "quack"]}



  "foo": "processing",
  "bar": "ready",
  "baz": "error",
  "quack": "not_found"

Send feedback - PUT document/{doc_id}/feedback


PUT document/{doc_id}/feedback
Content-Type: application/json

  "result": "incorrect",
  "fields": [


200 OK
Content-Type: application/json


Notify us of an (un)successful extraction based on human validation. In case of a mistake, names of fields where we made an error (or perhaps were superfluous or missing) should be passed.

Body of the request shall be an application/json object with these attributes:

  • result - one of correct or incorrect
  • fields (if result is incorrect) - list of field names

Note: This is an experimental endpoint. We currently consider feedback only from selected customers, please contact for details.