Data extraction in 15 minutes: Step by step guide

An RPA <> AP automation project needs an invoice data capture process that keeps up with the robots: reliable, accurate, scalable, layout-independent, rapid to implement, and easy to integrate. Traditional OCR-based systems just will not fit this bill.

Rossum’s cognitive platform is able to capture invoice information without any template setup, such as invoice table OCR, and UiPath robots bring the ease of integration to the extreme. In this step-by-step guide, we provide the source code that you will need to automate invoice processing in 15 minutes.

Data extraction in 15 minutes

Invoice extraction with UiPath webinar

Following the e-book see how easy it is to set up an integration with Rossum and UiPath in this 30 minutes webinar. Practical examples of use included, such as table OCR to easily integrate data from tables!

How table OCR works

Data is one of the most important currencies in the business world today. Without it, you will be unable to reach your competitors. With it, you will be able to automate business processes, achieve greater scalability, and have more information with which to train your employees. Data is king. However, the first problem we’re faced with is this: how do we capture the data? For it to be useful in systems or other processes, the data needs to be in a structured format. 

The issue is that most business data is now stored in unstructured data types like PDF documents and tables, images, and other formats that are difficult for machines to read data from. The overall goal of an OCR solution is to convert this unstructured information into usable data by digital scanning of the document and then extracting and capturing the data within that document. This process can enable you to automate entire business processes, give you cloud access to documents, and result in a boost in productivity and team motivation. 

One incredibly common part of a document that you would want to manage is tables. Images of tables or PDFs of tables that include important data have, historically, been incredibly difficult for machines to understand. In an ideal world, an image of a table would be scanned into the system. The system would then identify the rows and columns and build them to fill the appropriate cells with the appropriate data. This digital table would then be exported in a usable format like an Excel spreadsheet. 

This is ideal, but it very rarely happens. This leaves you wondering how to convert images to text in Excel. Even expensive OCR solutions that promise table data capture features will often do a very poor job. This means that your employees will have to regularly come in and correct all the errors that the machine-made. Table OCR is something that the team at Rossum has focused on steadily. 

We recognized that there was a problem with the image to Excel OCR tools and built a system that now may just be the best image to Excel converter anywhere. One of the other great aspects of Rossum is that our table OCR occurs within our easy-to-use validation interface, which allows you to quickly and easily refine the data that the OCR has captured in just a few clicks. After that, the table can be exported to the format of your choice. 

Extract table from image

The goal sounds simple – “extract table from image” – but experts have been trying for years to build a solution that could flawlessly capture data from an image of a table and export it to a structured tabular format like Excel. The difficulties that machines have with commands like “convert image to Excel table” are not felt by human employees. This is why, for so many years, data entry has been handled by human employees. They have been the ones responsible for completing each task on the checklist, including “extract data from image to Excel.” This is the most powerful form of data capture. One of the primary challenges of using OCR for tables is that they struggle to detect a table. Systems like this look for patterns to know what they are looking at. 

The problem is that there is a huge amount of variability in the way tables are organized in images and PDF files, making it difficult for machines to learn the patterns. 

To solve this problem, in 2019, we released a system that can effectively capture data from images of tables easily and with nearly 100% accuracy. 

Our solution leans heavily on machine learning and artificial intelligence to achieve this. Powered by neural networks that mimic how our brains process information, Rossum has a uniquely powerful ability to capture data from all kinds of unstructured formats. 

One example within the document management space where this technology can be very useful is invoice processing in the accounts payable department. Often, there are line items and tables on invoice documents. If you have a system that can recognize these tables within the invoices, you can capture all the data within them. This kind of automatic data capture opens up all kinds of opportunities to automate your business processes and creates new ways to grow your business. 


A well-built table OCR API should be able to scan documents and images and extract the table data from them accurately. Rossum’s OCR API is specifically designed to be able to handle tables from any channel with accuracy. This is vital if you want to be able to use the data within those tables for automation purposes. 

Cognitive OCR platforms use AI to “read” documents like humans do and can operate much more quickly and efficiently. A good OCR API could handle large amounts of documents and still ensure that all data was received, captured, and exported correctly. Using an OCR solution as part of an overall automation strategy can end up freeing up time for your employees and teams so that they can focus on tasks that can grow your business. 

Automation also brings scalability. In the past, teams like accounts payable could end up holding back the growth of the company. These teams would be quickly overwhelmed by the uptick in the volume of orders and invoices to manage. This could then lead to some team members leaving or mistakes being made that could lead to further delays. With an automated system powered by a table OCR API, you can ensure that fluctuations in the order volume do not disrupt ordinary operations, and growth is that much easier to accomplish. 

OCR table extraction

Over the years, several different methods have arisen to achieve accurate OCR table extraction. Python is one of the most popular and easy-to-learn programming languages. Several extract tables from image to Excel conversion Python programs have been written that enable data capture. 

Some programmers have written table detection OpenCV Python programs. OpenCV is used in conjunction with Python in this case. OpenCV is a collection of algorithms that allow a computer program to detect objects and data sources in images and other documents. This is getting into the nuts and bolts of building an OCR program, but some scanning engines have been built with no more than 30 lines of code. 

This may sound deceptively simple, but the technology has been years in the making and is still not perfect. Tesseract OCR table recognition Python programs are another category of rough-and-ready table extraction options. These are Python programs that utilize the Tesseract OCR engine for scanning. The limitation of using these “extract table from image online” programs is that they are not completely built solutions. They lack the security features and full functionality required within a business setting. 

Nonetheless, you can learn a great deal about OCR table extraction by reviewing some of the technical code that makes it possible. 

Table structure recognition deep learning

At the heart of innovations like Tesseract OCR table recognition Python programs is deep learning. In order to achieve a table OCR online solution, artificial intelligence is required. Deep learning is a part of the disciplines of both machine learning and the broader umbrella term of artificial intelligence. 

Basically, in order to successfully build an extracted table from a scanned PDF Python program, a system with deep learning is required. Simply put, deep learning refers to the process of teaching computers to recognize patterns. Neural networks are the technology that powers deep learning and makes it possible.

Neural networks are designed to assist a computing system in being able to learn and store information in a similar fashion to the way a human brain stores information. In other words, neural networks allow a computer to “learn” things. Why does deep learning matter when we want to extract tables from scanned PDFs? 

The roadblock that comes up when attempts are made to scan images for tables of data is the variability of the way tables are displayed. It can be very difficult for a computer to be able to recognize patterns when there are so many different possibilities. This task requires the computer to be able to use a higher level of abstract reasoning than is usually required from artificial intelligence systems. This is why table structure recognition deep learning is so important to this discussion. 

OCR to table

When it comes to “free” things, you sometimes get what you pay for. Although you may be able to find a table OCR-free solution, you may not be getting the quality you need. Most of these programs are highly error-prone. 

Rossum provides a high-quality OCR to table conversion as part of our overall intelligent document processing (IDP) solution. If you are looking to truly unlock your business data and create new automation opportunities, you need to go beyond just OCR data capture and utilize a comprehensive IDP solution like Rossum. 

We make it simple to convert large numbers of documents into more useful forms of data. With Rossum, the OCR table to Excel process is simple and straightforward, saving you more time to focus on growing your business.

Automate your table data extraction

Extracting data from tables is no easy task, especially when it comes to complex line items with nested values that are multiple pages long. No human should be stuck spending their days entering the data manually. Make the change today and move to an intelligent automated approach that will free up both your time and resources.