Our researcher, Tonda Hoskovec, has long been thinking about the behavior of neural network training in the case of non-convex tasks with many local minima. This makes training difficult or inconsistent for many machine learning problems. Also the theory is lagging behind in practice and not much is guaranteed. A recent theoretical paper aims at solving this problem in a new ingenious way and caught Tonda’s attention. He decided to do the first experimental test of this theory. Is it practical, and does it work? Read on to learn the outcome!

Recently the article Adding One Neuron Can Eliminate All Bad Local Minima really caught my attention. It deals with problems that have been on my radar for some time (and on radars of a lot of people [see refs. 9-31 in the paper]). It is very well written and understandable. And while I enjoyed reading it, there were no experiments in the paper! So I thought we could fix that and play around with it a bit, mainly for fun, because as we will also see later, there are good reasons for excluding experiments from the paper. The obvious one is that it has generally been agreed-upon that all the local minima of the empirical loss functions on the training set tend to perform similarly on the validation set and finding a global minimum can even lead to overfitting [Chromanska et al.]. Here we will first look in simple words at what the paper claims, then I will present some code snippets from my implementation of the core idea and then we will look at the experimental results and perhaps even suggest a novel method for learning early-stopping.

## Main claims the paper makes

The main result of the paper is that by adding (in the actual sense of mathematical add operation) output of a single exponential neuron to an output of a neural network we can make the network converge to a global minimum of its original loss function.

Ok, if this held for arbitrary neural network and task, it would be amazing, the paper only claims these results for binary classification and a specific albeit very general class of loss functions. The result came as a surprise to me because a single neuron on its own does not tend to behave this way [Auer et al.].

To be specific, the paper works with binary classifier, an arbitrary neural network that has a single output, a number between -1 and 1, where one class corresponds to positive numbers and the other to negative numbers. The targets \(\{y_i\}\) of our dataset are just numbers -1 or 1. An example of the loss function for which all the assumptions of the paper hold is the polynomial Hinge loss \[

l(z) = [\max\{z + 1, 0\}]^p, \, p \geq 3.

\] And the exponential neuron is parametrized by its weights, the real number we will call "scale" and label \(a\), a real vector \(\mathbf{w}\) of the same length as the input (\(\mathbf{x}\)) and a bias \(b\) (real number). Its activation is \[

f_e(\mathbf{x}) = a \exp(\mathbf{w}^T \mathbf{x} + b).

\]

Given these we can now describe the main result more formally as a *cookbook*:

- Take a dataset for binary classification, and a classifier—a neural network.
- Take the structure of the neural network and to its output add the output of a single exponential neuron.
- Modify the loss function with L2 regularizer of the scale weight.
- Do standard training and you will arrive to a point where the scale is zero (the exponential neuron is inactive), which is a global minimum of the original classifier trained on your dataset.

As the cookbook mentions, it is necessary to modify the original empirical loss function \[

L_n = \sum_i l(-y_i f_0(\mathbf{x}_i)),

\] where we sum over the whole dataset and \(f_0(\mathbf{x}_i)\) is the activation of the original neural network. We need to add L2 regularizer for the scale and modify the activation function to include the additional neuron, so that our empirical loss function becomes \[

L_n = \sum_i l(-y_i f(\mathbf{x}_i)) + \frac{\lambda a^2}{2},

\] where \(f(\mathbf{x}_i) = f_0(\mathbf{x}_i) + f_e(\mathbf{x}_i)\) is the modified activation. Notice how the regularization pushes the activation of the exponential neuron towards zero.

Now, from a practical stand point, we never train on the entire dataset, but rather on its *train* subsplit and we evaluate on the *validation* subplit. But we expect the points in both to come from the same distribution, so we don't mind.

### Intuition for the global minimum

The essential property of the exponential neuron that is responsible for this nice result is that the exponential function remains positive for all arguments together with the fact that its derivative with respect to the bias is exactly the same activation function. To see this let's write out the modified empirical loss and its derivative with respect to the scale and the bias respectively \[

L_n = \sum_i l \left( -y_i \left( f_0 (\mathbf{x}_i) + a \exp(\mathbf{w}^T \mathbf{x}_i + b)\right) \right) + \frac{\lambda a^2}{2},

\] \[

\frac{\partial L_n}{\partial a} = \sum_i l^\prime (\ldots) (- y_i) \exp(\mathbf{w}^T \mathbf{x}_i + b) + \lambda a,

\] \[

\frac{\partial L_n}{\partial b} = \sum_i l^\prime (\ldots) (- y_i a) \exp(\mathbf{w}^T \mathbf{x}_i + b).

\]

Now two of the necessary conditions for \(a\) and \(b\) to be part of the weights for which the empirical loss function is at its minimum are simply \(\frac{\partial L_n}{\partial a} = 0\) and \(\frac{\partial L_n}{\partial b} = 0\). Let us write those out, but multiply the first one by \(a\): \[

a \sum_i l^\prime (\ldots) (- y_i) \exp(\mathbf{w}^T \mathbf{x}_i + b) + \lambda a^2 = 0,

\] \[

\sum_i l^\prime (\ldots) (- y_i a) \exp(\mathbf{w}^T \mathbf{x}_i + b) = 0.

\] The first result that we can get from these is the fact that the additional neuron is inactive in any local minimum of the modified loss function. To see this consider that for any local minimum these two equations have to hold simultaneously, we can subtract them and get \[

\lambda a^2 = 0.

\] Because \(\lambda > 0\) by the assumption of the L2 regularization, \(a\) has to be zero.

Now for any local minimum we can drop everything that is multiplied by the scale weight! Obviously that means that the neuron is inactive. We can also see the reason for the regularization here, the whole thing would never work without it.

But even more importantly the condition on the derivative still has to hold in any local minimum even if \(a\) is zero \[

\sum_i l^\prime ( -y_i f_0 (\mathbf{x}_i) ) (- y_i) \exp(\mathbf{w}^T \mathbf{x}_i + b) = 0.

\] From this equation alone, it cannot be seen that every local minimum is the global minimum of the empirical loss function. But the essential argument is that for any datapoint \(\mathbf{x}_i\) the exponential is a positive number and from this one can make additional assumptions about the rest of the sum which together with the assumptions about the loss function lead to the desired result.

The complete proof that there is only a global minimum of the empirical loss function and no bad local minima can be found in the paper. It is quite formal and to understand it you have to be familiar with tensors, Lagrangian interpolating polynomials and Taylor expansions. It is not difficult if you remember these things from any course on mathematical analysis, but it has to be put together with some care.

## Experiment

So, what about the experiment? I did a little searching and found a suitable published classifier, a neural network, on GitHub, which is maintained by the neurology labs at UNL and UCD Anschutzt. It has a dataset of EEG time series, which are labeled into two classes. The data points come from real people who were recorded picturing either an activity they know or an unfamiliar one. You can read more about the dataset on their repository and in the papers they published. But in essence, I had both the input vectors, the corresponding targets and the neural network to modify and I could check if it can be improved with the extra neuron.

I implemented the whole thing in Python and Keras and I already tried to alert you to some issues I ran into when doing this. The first one is that most of the Keras losses require the targets to be zeros and ones for binary classification and we have them as minus ones and ones. We also need the polynomial Hinge loss which is currently unavailable in Keras for \(p \geq 3\). And finally, I needed the exponential neuron. I will share here snippets from my code that you will need if you try and run one of these yourself.

So, the first one is the exponential neuron Keras layer, notice the L2 regularization of the scale weight, the value 0.1 is quite large, but it is the one I found to work the best:

from keras.layers import Layer, Input, Add from keras.regularizers import l2 class EXP(Layer): def __init__(self, **kwargs): super(EXP, self).__init__(**kwargs) def build(self, input_shape): # trainable scale weigth weigthShape = (1,) #create the weights: self.scale = self.add_weight(name='scale', shape=weigthShape, initializer='uniform', trainable=True, regularizer=l2(l=0.1)) # kernel weights nparams = input_shape[1] weigthShape = (nparams, 1) # create the weights: self.kernel = self.add_weight(name='kernel', shape=weigthShape, initializer='uniform', trainable=True) # bias weight weigthShape = (1,) # create the weights: self.bias = self.add_weight(name='bias', shape=weigthShape, initializer='uniform', trainable=True) super(EXP, self).build(input_shape) # activation: def call(self, x): prod = K.dot(x, self.kernel) exp = K.exp(prod + self.bias) result = self.scale * exp return result def compute_output_shape(self, input_shape): return input_shape[0], 1

Then I needed to modify the Keras accuracy and have the polynomial Hinge loss that both work with targets between -1 and 1:

from keras import backend as K def p_hinge(y_true, y_pred): argument = - y_true * y_pred + 1. maximalized = K.maximum(argument, 0.) return K.pow(maximalized, 3) def h_accuracy(y_true, y_pred): return K.mean(K.equal(K.sign(y_true), K.sign(y_pred)), axis=-1)

In order to have comparable results (with the exponential neuron and without it), I took the original structure and only changed the activation of the last layer from sigmoid to tanh and used the Adam optimizer instead of the original RMSprop. So the unmodified network for my experiment was trained with:

batch_size = 128 epochs = 400 model = Sequential() model.add(Conv2D(32, (3, 3), padding='same',input_shape=input_shape)) model.add(Activation('relu')) model.add(Conv2D(32, (3, 3))) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(10)) model.add(Activation('relu')) model.add(Dense(1)) model.add(Activation('tanh')) # initiate adam optimizer opt = keras.optimizers.adam(0.0002) # Compile the model model.compile(loss=p_hinge, optimizer=opt, metrics=[h_accuracy]) x_train = x_train.astype('float32') x_test = x_test.astype('float32') history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test), shuffle=True)

To obtain the dataset, run the code in the original notebook and add somewhere:

y_train[y_train == 0] = -1. y_test[y_test == 0] = -1.

And this is how I modified the structure with the extra exponential neuron:

from keras.callbacks import Callback batch_size = 128 epochs = 400 class ScalesSaver(Callback): def __init__(self, model): self.model = model self.scales = [] def on_epoch_end(self, epoch, logs={}): self.scales.append(self.model.get_layer("exp").get_weights()[0][0]) model = Sequential() model.add(Conv2D(32, (3, 3), padding='same', input_shape=input_shape)) model.add(Activation('relu')) model.add(Conv2D(32, (3, 3))) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(10)) model.add(Activation('relu')) model.add(Dense(1)) model.add(Activation('tanh')) feats = Input(shape=input_shape) orig = model(feats) flat = Flatten()(feats) extra = EXP(name="exp")(flat) combined = Add()([orig, extra]) mod = Model(feats, combined) # initiate adam optimizer opt = keras.optimizers.adam(0.0002) # Compile the model mod.compile(loss=p_hinge, optimizer=opt, metrics=[h_accuracy]) x_train = x_train.astype('float32') x_test = x_test.astype('float32') history_scales = ScalesSaver(mod) history_exp = mod.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test), shuffle=True, callbacks=[history_scales])

Notice how not only I used the exponential neuron layer, but I kept track of the scale weight as well with the custom callback. I wanted to check if the scale would approach zero as we fit the classifier, but not to get ahead of ourselves, let's first see the results!

## Results

So without further ado, here are the validation accuracies of the original model and the modification from one training:

The red and blue colors correspond to the networks with and without the exponential neuron respectively. And we can see that the highest achieved validation accuracy is higher for the network with the exponential neuron! This is not a cherry-picked example of training, consistently the model with the extra neuron achieved higher accuracy. In this run the difference was 0.831 vs. 0.828, which is not very large. But I ran the training five times and here are the the 90% confidence intervals of the validation accuracies:

It is still not clear how much these confidence intervals overlap even though it seems that the red color is winning.

So I did a statistical Mann-Whitney test on the maximum accuracies achieved to see if the additional neuron actually helps the network. The result is a p-value of 0.006, which is quite low and my feeling was now shifting towards the neuron helping more than not. The behaviour of achieving a higher accuracy was quite consistent over the runs I did. On average the maximum accuracy achieved was 0.81207 for the unmodified network versus the modified network's 0.82195, which is approximately a 0.01 difference in accuracy.

Of course this was a single dataset, single neural architecture so we cannot draw any general conclusions from this at all (the five recorded runs are also a very low amount, so I would not rely on the results even for this case :)) and a lot more work would be required to achieve that. I should mention that this is supposed to whet your appetite, not to give a rigorous discussion about the results, if you feel like it, go ahead and do more runs yourself! I should also mention that I ran other experiments as well, not so successfully. The most interesting one was that I tried adding the extra neuron to a GAN discriminator. But because GANs are so sensitive to its parameters, it was very difficult to make any conclusions about the effect there.

I became also somewhat interested in the massive jumps in the learning curve after some of the epochs- at first I thought the reason for this are the updates of the scale weight. My intuition was that perhaps it is very close to zero and so some updates might change its sign which would influence the overall predictions quite a bit. As it turned out, this was closely related to another question I had.

Very natural question to ask is: Could the scale weight be used to see how far we are from the global minimum? And so I plotted its value after each epoch and shared the epoch number axis with the plots of validation loss and accuracy to get:

And here we can see that although the scale increases at first, it then converges to zero and flattens near the end of the training. It would be hard to define when to stop the training based solely on that since we can see that some of the epochs made the validation accuracy much worse even if the scale approached zero, but I can image an early stopper based on a combination of this and other things. The shape is very nice, smooth and stable. By accident this plot also immediately showed that my intuition for the peaks is completely wrong, the change in this parameter was very smooth and not once did it change sign. So the cause for the peaks probably lies in the dataset, batch size etc. since they also appear in the unmodified network training as well.

Overall I had a lot of fun playing with the extra neuron and I plan to experiment with it further. And I hope it is a nice addition to the original paper.

## Additional remarks

I just want to mention that there are extensions in the paper for example for the special case of fully connected neural networks which I neglected in this blog post. The reasons for the decision were twofold, the post would have been way too long and finding a fully connected network to experiment with was not as easy as it seemed. But I encourage everyone to go through the extensions as well, since there are some extremely interesting ideas for the future research.

Find out more about Rossum’s data extraction technology.