Today, we discussed a general rule to set the architecture of the last layer of a neural network. More specifically, we discussed how to set the output of a network and the loss of the network. Yoshua said that it is a good idea to have the network output a probability distribution over the r.v. Y (by either outputting a discrete distribution or the parameters of a continuous distribution) and the loss being -log P(Y=y|x) (the negative log-likelihood).
The problem I see with this is in the continuous case, since we are ditching the « non-parametricness » of the neural networks, which I find to be very desirable, by using (very often) multivariate normal distributions. I think so since, for one, their is a very small set of multivariate distributions from which to choose and most (not to say all) of those distributions are not very « complicated », for example, they have only one mode.
My first idea was to solve this by training a NN to output a NN that takes an input y (where y is in the domain of Y) and outputs P(Y=y|X=x) where x is the input to the first NN. But this seems overly complicated and possibly impossible to train. You would need to test tons of hyperparameters to get the size of the output NN to be good and I guess it would need to be fairly large because you aren’t passing expert knowledge to the network on what it is supposed to do. One would also need to make sure the output of the second NN is a distribution. One way to do it would be to integrate the output of the second NN, say z, over the domain of Y, to get say Z, and calculate the cost with z/Z. The integration should be pretty straight forward since we only have one input, one output and simple activation functions (ReLU are even piecewise linear) so we should be able to get a closed form expression.
My second idea is actually a cheat on the « parametricity ». There are some unidimensional distributions which can approximate any other distributions (for example, phase-type distribution for positive-valued distributions). The problem with these distributions is that they are often very hard to generalize to the multivariate case. One way to fix this would be to use copula to model the dependency between the various dimensions, but since copulas work on pairs of dimensions, it would be hard to use them.
It turns out, that there exist multivariate phase-type distributions! I didn’t know about them until now, but as I understand they are dense in the set of all positive-valued multivariate distributions. They can have has many parameters has we want depending on how good we want to approximate the complex distribution P(Y|X=x). The training would be a little bit fancy, but, if we chose a distribution which doesn’t have too many parameters, we might manage. I’d be very interested in trying something of the sort at some point. The obvious advantage would be that instead of only predicting a real valued output, we could calculate any quantities of interest, e.g. confidence intervals, variance, skewness, mode, median, …
I think it is not too far fetched to say that this neural network would need to output far less stuff than my first idea which required an output for every weight of the second NN. This is due to the fact that we are doing mostly the same thing as in idea 1 except that we restrict the final function (our estimator for P(Y=y|X=x)) to be a distribution of a given type.
In most cases, maybe this extra information isn’t worth it, but I can see it being useful in various businesses. For example, in insurance, we would like to know, not only the average amount of losses a client might incur in the next year, but also how likely is he to cost more and what is the 99th quantile of his loss distribution and etc.