Peeking into the Neural Network Black Box
If you have ever come in touch with neural networks, you are probably familiar with the black box problem [1, 2]. Compared to many other algorithms from the glass box category , neural networks are inherently difficult to dissect. This should come as no surprise. We use neural networks to find solutions to problems which are difficult for humans to put into the language of algorithms. Whenever it is difficult for an expert to find features which would help any other Machine Learning algorithm, that’s typically where neural networks come into the picture and blow the competition out of the water.
So what can be done? As it turns out, the inherent difficulty of explaining how neural networks work does not deter everyone and there are actually people who, I would say, are even drawn to the difficulty . I still think the proposed methods are far from explaining how a neural network works on the entire dataset, on all classes etc. Right now, in my opinion, the explanation methods are most powerful in a sample-by-sample examination. They are, for example, able to tell you which parts of the image play the most important role in an image classification task. I would still argue that there are important conclusions to be drawn about the task as a whole.
Even though the methods are very general, in the end, I will of course focus on models we use at Rossum for invoices. If you last until the end of this blog post, I promise you will see some nice results regarding invoices and tips on how to use those in your own projects.
Summing Up Commonly Used Methods for Explaining Classifiers
The most important part of the methods for explaining classifiers, which is stressed by some and skimmed over by others, is finding a reference input. The idea always revolves around finding the relevant parts of an input, say an image, responsible for the classification. And mostly we do so by slowly changing a reference input to a given sample while watching carefully how the output of our network changes. If we notice that some parts of an input sample play an important role, we take note of it. There are many methods for observing the change, for causing it and for explaining it . In the most beautiful definition of the reference point I have seen , authors define it as an input with maximal entropy on the output of the neural network. In other words, we should choose the reference input, or the starting point, so that our neural network is as confused and uncertain as possible. And then start changing it into a sample that we would like to explain.
Very often this reference point is taken to be a black image or an empty string. But that does not always make sense – why should a neural network trained on some dataset classify a black image equally into all categories? From my experiments with several explanation methods, it is this reference point that actually explains the network the most – this is what we should be looking for instead.
Classifying a Slice
In general, most of the methods assume to work with classifiers. For experiments with invoices, let’s assume that we have a model that can take an image, slice it and classify the slice into categories. For example, we could have a model with three categories good, bad, neutral and we could ask it to classify fighters from Star Wars.
Depending on the rectangle we drew, we could imagine the resulting class is going to change. We would hope the bounding boxes below would each get classified to its corresponding class.
Explaining the Neural Networks for Reading Invoices
We use quite a lot of neural networks at Rossum, and one of them can read invoices. I took a model that can look at an invoice, slice it and return a vector of numbers between 0 and 1 that we can interpret as probabilities of the slice belonging to each category (amount_total,…). So what I attempted to do was to take a slice of the invoice that our model thinks is, for example, amount_total, and try to find areas of the invoice responsible for that prediction.
As I mentioned earlier, the task for me, more or less, became finding the reference input. It turned out to be at the core of the problem. I asked the model: which parts of the invoice do I need to modify in order to confuse you as much as possible?
So I took different slices of several invoices and one-by-one, I replaced them with white color, essentially erasing parts of the document. Using the aforementioned methods, I took notice of how much each change influenced the output of our model. After I knew how each change influenced the output, I was able to estimate which changes I would need to make to the invoice to confuse the model on the task of classifying the given slice as much as possible.
I color-coded the influence estimates with a blue-red colormap and plotted them onto the invoices. So the slices that have a very low influence are blue and the slices that have a great influence are red. I also plotted the slice which is meant to be classified with a green color. Below are examples of different classes and the results of the attribution method.
Here I took an invoice, a slice which the network thought was an invoice_id, and plotted all other reasonable slices and their influence on being far from maximum entropy.
This time I took an invoice and a slice that gets classified as amount_total.
What is the Benefit?
The images above, I still feel, are quite amazing. It turns out that our model seems to have learned to search for what I would expect a human to search for. But of course, not all the runs of the attribution method resulted in such nice and simple explanations. Sometimes the output gave us insight into our dataset and hinted at the direction we need to take to improve our model. For example consider the image below:
It seems that on this invoice the slices that need to be erased from the invoice in order to confuse the model, which previously categorized the input slice as the issue date, are the invoice ID slices. So it seems that our model has learned that the issue date is usually close to the invoice ID. Knowing this information, we knew how to improve the dataset to allow our models to avoid making that mistake. Simply including more invoices where the spatial relation between issue date and invoice ID is different.
The method, as it turned out, also gave interesting results when we used it on our classification models. These models have the task of, for example, classifying the currency of the invoice. When I asked the currency model which part of the invoice I needed to erase in order to confuse it, many times the result was a red bounding box around the symbol of the currency (€, £) on the invoice. A couple of times, it turned out that having “Berlin” or “United Kingdom” on it influenced the prediction even more.
We were able to find several more cases where this method gave us insights into our models and datasets. Sometimes we discovered properties of our models which are not easy to classify directly as good or bad, but are important to know nevertheless, such as the influence of geography-related words on the currency classifier of the invoice. Or we learned that if a country is present on the invoice, language related slices tend to have more influence on the classification. For example, Germany being on the invoice meant that slices with Umsatzsteuer (USt) had more influence on something near it being classified as a VAT than when Germany was not on the invoice.
 “iNNvestigate neural networks!” (https://arxiv.org/abs/1808.04260) by Maximilian Alber, Sebastian Lapuschkin, Philipp Seegerer, Miriam Hägele, Kristof T. Schütt, Grégoire Montavon, Wojciech Samek, Klaus-Robert Müller, Sven Dähne and Pieter-Jan Kindermans  “Explanation Methods in Deep Learning: Users, Values, Concerns and Challenges” (https://arxiv.org/abs/1803.07517) by Gabrielle Ras, Marcel van Gerven and Pim Haselager  “On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation” (PLoS ONE 10(7): e0130140) by Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller and Wojciech Samek