Table Extraction

The Elis Invoice Extractor currently automatically extracts a line item table content and recognizes row and column types as detailed below.

Supported Table Formats

Invoice line items come in a wide variety of different shapes and forms. The current implementation can deal with (or learn) most layouts, with borders or not, different spacings, header rows, etc. We currently make two further assumptions:

  • The table generally follows a grid structure - that is, columns and rows may be represented as rectangle spans. In practice, this means that we may currently cut off text that overlaps from one cell to the next column. We are also not optimizing for table rows that are wrapped to multiple physical lines.

  • The table contains just a flat structure of line items, without subsection headers, nested tables, etc.

We plan to gradually remove both assumptions in the future.

Output Format

The JSON response will contain key tables with a list of objects describing every table extracted from the document. If no table is found, the list is going to be empty. The table object structure can be derived from the following example:

{
  "column_types": ["table_column_quantity", "table_column_description", "table_column_amount", "table_column_amount_total"],
  "rows": [
    {
      "cells": [
        {"content": "QUANTITY", "bbox": [90, 756, 240, 816]},
        {"content": "DESCRIPTION", "bbox": [330, 756, 534, 816]},
        {"content": "UNIT PRICE", "bbox": [696, 756, 870, 816]},
        {"content": "AMOUNT", "bbox": [936, 756, 1068, 816]}
      ],
      "type": "header"
    },
    {
      "cells": [
        {"content": "150", "bbox": [90, 882, 240, 918]},
        {"content": "Item 1", "bbox": [330, 882, 534, 918]},
        {"content": "£15", "bbox": [696, 882, 870, 918]},
        {"content": "£2250", "bbox": [936, 882, 1068, 918]}
      ],
      "type": "data"
    },
    {
      "cells": [
        ...
      ]
      "type": "data"
    },
    ...
  ],
  "page": 0,
  "bbox": [90, 756, 1068, 1068]
}

where the individual attributes of a table object are:

  • column_types: a list containing identifiers (see next section) describing the informational content of the corresponding table's columns. The number of identifiers is therefore the same as the number of extracted columns.
  • rows: a list of objects describing individual rows of the extracted table.
  • page: an integer denoting the page on which the table appears in the document, starts at 0. Multi-page line items are split to multiple per-page table objects.
  • bbox: a bounding box describing the position of the table within the page.

A table row is described by an object with 2 attributes:

  • cells: a list of objects describing each cell within the corresponding row.
  • type: either header or data with the header row assumed to contain the columns legend and a data row the actual content.

Finally, each cell is describe by an object with 2 attributes:

  • bbox: a bounding box describing the position of the cell within the page of the document.
  • content: textual content of the cell, verbatim as written with no (post-)processing applied on it.
  • value - a string containing a machine-readable parsed cell value that should be interpreted based on the value_type field
  • value_type - one of the following:
    • "number": value is a numerical value, either a %d integer (e.g. for numerical identifiers) or a %d.%d fixed-point decimal (e.g. for amounts and rates)
    • "text": value is an arbitrary string (note that this string may still differ from content as a value that is more sanitized for further processing, e.g. with stripped spaces and other characters $
    • Note that more value types may appear in the future.

Column Types Description

  • table_column_code - Item identifier
    • Can be the SKU, EAN, a custom code (string of letters/numbers) or even just the line number.
  • table_column_description - Item description
    • Textual description of the line item.
  • table_column_quantity - Quantity
    • The number of particular instances of the item. It is usually accompanied by a unit of measure (see uom).
  • table_column_uom - Unit of measure
    • The unit in which the quantity of the item is measured.
  • table_column_rate - Tax rate
    • The percentage tax rate applied to the item(s).
  • table_column_tax - Total tax
    • The amount of money to be paid as tax for the item(s).
    • Rule of thumb: tax = rate * amount_base.
  • table_column_amount - Unit price (tax included)
    • The amount to be paid for one item including the tax.
    • Rule of thumb: amount = amount_base + tax.
  • table_column_amount_total - Total amount (tax included)
    • The total amount to be paid for all the items including the tax.
    • Rule of thumb: amount_total = amount * quantity.
  • table_column_amount_base - Unit price (tax excluded)
    • The amount to be paid for one item excluding the tax.
  • table_column_amount_total_base - Total amount (tax excluded)
    • The total amount to be paid for all the items excluding the tax.
    • Rule of thumb: amount_total_base = amount_base * quantity.
  • table_column_other - Unrecognized data type
    • Returned when none of these types fits the column content well (in the algorithm's opinion).

Note that the list of column types may expand over time.