Final post

I just wanted to add a graph a test accuracy over the training for all the sub-models:

training_full_sub

  • Blue : 3×3
  • Green : 4×4
  • Red : 5×5
  • Teal : 6×6
  • Purple : 7×7
  • Yellow : Increasing
  • Black : Decreasing

So I guess the best filter sizes on 128×128 images are 3×3 across all layers.

Final post

Results for the overall model

Here are the training (blue, single mini-batch) and green (test) accuracy throughout 20 epochs:

training_full_pretrain

I’m really disappointed that I couldn’t get over 85% accuracy on the test set, but according to the test I made here are some results about the filter sizes:

  • The individual accuracy of the sub-models are:
    •  43.75%
    • 37.5%
    • 46.875%
    • 50%
    • 50%
    • 53.125%
    • 50%
  • which seems to indicate two problems, first the model is unlearning and second this looks a lot like ensemble methods were many poor classifiers can manage to do a good job overall.
  • It would also seem to indicate that progressively bigger filter are better, but this result isn’t very credible given the poor performance of the sub-models.
  • We can see the effect of pretraining as the model quickly gains accuracy over random choice in the first epoch (over 20% accuracy gain). After this, I have the feeling that what was learnt during the pretraining is getting forgotten.
Results for the overall model

Training the whole model

What I found out to be working to train the whole model is the following:

  • Adding some batch normalization.
    • The idea is that batch normalization seems to allow more gradient to flow through the network leading to faster training, better generalization and allows us to use a higher learning rate.
  • Pre-training the sub-models.
    • Using a loss that only accounts for the losses on the sub-models, I can train for a few epochs and after train the whole model using only the loss on the softmax classifier. While it is hard to know when to stop since while pre-training the total model accuracy always hover around 50% (since the last layer is not updating it’s weight), I’ve found out that training for about 5 epochs seemed to really improve the training of the full model. Not only is training much faster after the pre-training, it also seems to find a better minimum for the loss function.
  • Connecting the softmax layer to the second fully connected layers of the sub-models (instead of the sigmoid classifier).
    • I think I was losing to much information about what my sub-models were learning when only connecting to their output layers. Also, I feel like what I was doing was too similar to bagging, a traditional ensemble method. To remedy this, I connected my softmax classifier to the last fully connected layer of each sub-model. I can still check the contribution of each sub-model to the final output, only now I need to sum over all the weights connecting to the output (which is still very easy to do).

In the next post, I’ll show the training curves and my results.

Training the whole model

Investigating filter sizes

As a result of my previous post, I decided I needed to go for a bigger model, but I also want to focus my work on the problem that is the « right » filter size. So I decided to create a model that would be the sum of many parallel models each with different filter sizes (all else being equivalent) and see which model has the biggest contribution to the total model.

Here is how one of the small model is built:

  • (Input : 3x128x128)
  • ConvLayer : 64 ?x?
  • Max Pooling : 2×2 windows with 2×2 strides
  • ConvLayer : 64 ?x?
  • Max Pooling : 2×2 windows with 2×2 strides
  • ConvLayer : 64 ?x?
  • Max Pooling : 2×2 windows with 2×2 strides
  • ConvLayer : 64 ?x?
  • Max Pooling : 2×2 windows with 2×2 strides
  • ConvLayer : 64 ?x?
  • Max Pooling : 2×2 windows with 2×2 strides
  • ConvLayer : 64 ?x?
  • Max Pooling : 2×2 windows with 2×2 strides
  • Fully Connected : 128 units
  • Fully Connected : 128 units
  • Sigmoid Output : 1 unit

All of the non-linearity are rectifier (except for the outputs)

And my whole model is seven parallel small model (they share their inputs and their outputs are all connected to a softmax layer with 2 units). The idea being that each of the small model will be trying to be as good as it can and the final softmax layer will only be weighting to models. So the bigger the weight the more the associated filter pattern will be good.

The filter sizes are the following:

  • Sub-model 1 : 3×3 (constant over all layers)
  • Sub-model 2 : 4×4 (constant over all layers)
  • Sub-model 3 : 5×5 (constant over all layers)
  • Sub-model 4 : 6×6 (constant over all layers)
  • Sub-model 5 : 7×7 (constant over all layers)
  • Sub-model 6 : 2×2, 3×3, 4×4, 5×5, 6×6, 7×7 (each layer as a bigger filter)
  • Sub-model 7 : 7×7, 6×6, 5×5, 4×4, 3×3, 2×2 (each layer as a smaller filter)

The next post will focus on how I trained this model, but now let’s look at a few things I tried that didn’t work.

  • Adding dropout on top of the sub-models (right before the softmax).
    • The training never moved from its initial predictions, or at least it didn’t get any better. The accuracy was stuck around 50% and no matter how many epochs went by, no progress was ever made. I think the problem was that with only seven units to choose from, and the limited number of connections associated, dropout was effectively blocking all the gradient from my loss to go through the lower layers. The idea to have a dropout layer there was so that the overall model doesn’t choose randomly one of the model to be its « main » model and then adjust the other so that they correct the « main » model. To solve this problem without dropout, I decided to use a custom loss function for my network which is the sum of the losses from the small models and the overall model. This way, since every sub-model is also trying to do the best it can, I think the overall model will have no choice but to pick the weights according to the quality of the sub-models.
  • Putting equal weights on the losses from the sub-models and on the overall model.
    • I guess there was too much « fighting » between the losses so that the model couldn’t find a direction to move in. What I saw was either the model learning to predict the same label all the time or something that was equivalent to randomly picking the label (in both case the accuracy was about 50%).
Investigating filter sizes

Data augmentation

I used Fuel’s functions to create a server which did data preparation for my neural network. In this data preparation, I did a few things that could be considered data augmentation. For example, for each image, I’m taking a random crop of the image, which I rotate randomly. In the end, the server outputs a 128×128 images batch ready to be used to train my network.

I tried using this server to retrain my previous model, to see the impact of the data augmentation and saw absolutely no gain in performance! My guess is that my architecture wasn’t deep/wide enough for my model to have enough capacity to make use of the extra information provided through the data augmentation.

I’ve moved on to a model a little different since I wanted to investigate the impact of filter size on my model:

  • input : 128x128x3
  • Conv2D : 64 5×5 filters
  • Max Pooling : 2×2 pooling with 2×2 strides
  • Conv2D : 32 3×3 filters
  • Max Pooling : 2×2 pooling with 2×2 strides
  • Conv2D : 32 3×3 filters
  • Max Pooling : 2×2 pooling with 2×2 strides
  • Conv2D : 32 3×3 filters
  • Max Pooling : 2×2 pooling with 2×2 strides
  • Fully Connected : 128 units
  • Fully Connected : 128 units
  • Softmax : 2 units

I also trained for a long time (over 25 epochs) and got the following results :

training_3x3

and exactly 80% (0.80000004) on the validation set. My conclusion, seems to be that right now, what is preventing me from learning a better model is the lack of capacity and not the filter size. We also see that odd sized filters don’t seem to perform better than even sized ones.

Data augmentation

About filter size

I recently got thinking about filter size. More specifically, I wondered whether we should use odd or even sized filter. One one hand, odd filter size seem more common and have easier interpretation (if the weight of the central pixel is far from 0). The interpretation is something along the line of take the central pixel and blend in information about its close neighbors. On the other hand, I’ve gotten much better results up to now with even sized filter. Also, it is pretty obvious that the filter size should depend on the input size as a 3×3 filter on a 128×128 image won’t capture the same level of details as a 3×3 filter on a 1024×1024 image. I don’t know that there are any rules to choosing filter sizes, but I will try and investigate further.

One thing I will implement (next week?) is multiple filter sizes in parallel as in Google’s Inception-3 model. This seems to be a best of both world solution and I hope it will improve my results.

About filter size

Follow up on class discussion

Today, we discussed a general rule to set the architecture of the last layer of a neural network. More specifically, we discussed how to set the output of a network and the loss of the network. Yoshua said that it is a good idea to have the network output a probability distribution over the r.v. Y (by either outputting a discrete distribution or the parameters of a continuous distribution) and the loss being -log P(Y=y|x) (the negative log-likelihood).

The problem I see with this is in the continuous case, since we are ditching the « non-parametricness » of the neural networks, which I find to be very desirable, by using (very often) multivariate normal distributions. I think so since, for one, their is a very small set of multivariate distributions from which to choose and most (not to say all) of those distributions are not very « complicated », for example, they have only one mode.

My first idea was to solve this by training a NN to output a NN that takes an input y (where y is in the domain of Y) and outputs P(Y=y|X=x) where x is the input to the first NN. But this seems overly complicated and possibly impossible to train. You would need to test tons of hyperparameters to get the size of the output NN to be good and I guess it would need to be fairly large because you aren’t passing expert knowledge to the network on what it is supposed to do. One would also need to make sure the output of the second NN is a distribution. One way to do it would be to integrate the output of the second NN, say z, over the domain of Y, to get say Z, and calculate the cost with z/Z. The integration should be pretty straight forward since we only have one input, one output and simple activation functions (ReLU are even piecewise linear) so we should be able to get a closed form expression.

My second idea is actually a cheat on the « parametricity ». There are some unidimensional distributions which can approximate any other distributions (for example, phase-type distribution for positive-valued distributions). The problem with these distributions is that they are often very hard to generalize to the multivariate case. One way to fix this would be to use copula to model the dependency between the various dimensions, but since copulas work on pairs of dimensions, it would be hard to use them.

It turns out, that there exist multivariate phase-type distributions! I didn’t know about them until now, but as I understand they are dense in the set of all positive-valued multivariate distributions. They can have has many parameters has we want depending on how good we want to approximate the complex distribution P(Y|X=x). The training would be a little bit fancy, but, if we chose a distribution which doesn’t have too many parameters, we might manage. I’d be very interested in trying something of the sort at some point. The obvious advantage would be that instead of only predicting a real valued output, we could calculate any quantities of interest, e.g. confidence intervals, variance, skewness, mode, median, …

I think it is not too far fetched to say that this neural network would need to output far less stuff than my first idea which required an output for every weight of the second NN. This is due to the fact that we are doing  mostly the same thing as in idea 1 except that we restrict the final function (our estimator for P(Y=y|X=x)) to be a distribution of a given type.

In most cases, maybe this extra information isn’t worth it, but I can see it being useful in various businesses. For example, in insurance, we would like to know, not only the average amount of losses a client might incur in the next year, but also how likely is he to cost more and what is the 99th quantile of his loss distribution and etc.

Follow up on class discussion

Psychology works!

So I must admit, I haven’t updated my blog and my repository as often as I should have. Still, my blog doesn’t reflect all the work I’ve done on the project! So in the next few days, I’m going to try and publish a few post showing the work I’ve done and what I’m hoping to achieve with the project.

The first entry in this serie (i.e. this post) will cover everything I’ve tried before hitting the 80% mark and maybe even more!

In my last post, I covered how I did my image preprocessing (still without data augmentation). Once that was setup, I went straight for the 80% goal! In hindsight, my first attempt was a bit overkill. I thought that I would make certain to get at least 80% accuracy by implementing a known network architecture, namely the VGG-16 (http://arxiv.org/abs/1409.1556).

I used Lasagne, to get everything in order fast and I trained my network from scratch. Without surprise, I got over 80% accuracy after a long training time. I didn’t save the training curves (maybe I will find time to run it again at some point to get them), since I don’t think this model is very interesting for the problem at hand.

Seeing how this model took about all the memory on my gpu even with small batch size, I got curious as to how small and simple could the model go and still hit the 80% mark. I inspired myself from Guillaume Alain’s model (https://gyomift6266h15.wordpress.com/2015/02/05/hit-the-80-mark-continued/) to get the following model (I’ll also update my github repo soon)

  • Input is resized + cropped to 128×128 (I could probably even go smaller!)
  • 32 filters convolution (4×4 filter)
  • max pooling (2×2 window, 2×2 stride)
  • 16 filters convolution (4×4 filter)
  • max pooling (2×2 window, 2×2 stride)
  • 16 filters convolution (4×4 filter)
  • max pooling (2×2 window, 2×2 stride)
  • 16 filters convolution (4×4 filter)
  • max pooling (2×2 window, 2×2 stride)
  • 16 filters convolution (4×4 filter)
  • max pooling (2×2 window, 2×2 stride)
  • 16 unit fully connected layer
  • 16 unit fully connected layer
  • 2 softmax unit

All non-linearities were ReLU. I used ADAM as my training algorithm. My batch size was set to 48…

And after as little as five epochs, my train and test errors were below 20%! I need to look into how to make training/test error plots, but for now here is my python console output :

 

# Neural Network with 24722 learnable parameters

## Layer information

# name size
— ——– ———-
0 input 3x128x128
1 conv2d1 32x125x125
2 maxpool1 32x62x62
3 conv2d2 16x59x59
4 maxpool2 16x29x29
5 conv2d3 16x26x26
6 maxpool3 16x13x13
7 conv2d4 16x10x10
8 maxpool4 16x5x5
9 dense1 16
10 dense2 16
11 output 2

[…]

sum(np.equal(y_test, preds))/len(preds)
Out: 0.8125

In my next post, I’ll probably cover what I’m looking at now, or what I want to try next. I should also do a post on the problems I encountered and how I fixed them.

Psychology works!

Creating a fuel transformer to rescale

The basic « MinimumImageDimensions » transformer in Fuel has some drawbacks. Mostly, it only upscales images to a minimum dimension. Since I wanted to downscale the bigger image at the same time, I modified it to create the « MinMaxImageDimensions » (not a good name, but I tried keeping the name inline with the previous one). The code is there: https://github.com/etiennelhardy/ift6266-h16/blob/master/image_transform.txt

One of the key thing is that the basic « MinimumImageDimensions » transformer does NOT work on a default DataStream! This is due to the fact that the default DataStream converts the channel values to floating points. But PIL (the python image library) works on array of integers from 0 to 255. I worked around that by simply multiplying the floating points by 254 and then casting to integers. I’ve done a few tests and everything seems to work.

Creating a fuel transformer to rescale