출처: https://www.quora.com/What-is-the-dying-ReLU-problem-in-neural-networks

Here is one scenario:

Suppose there is a neural network with some distribution over its inputs X. Let's look at a particular ReLU unit R. For any fixed set of parameters, the distribution over X implies a distribution over the inputs to R. Suppose for ease of visualization that R's inputs are distributed as a low-variance Gaussian centered at +0.1.

Under this scenario:

Most inputs to R are positive, thus

Most inputs will cause the ReLU gate to be open, thus

Most inputs will cause gradients to flow backwards through R, thus

R's inputs are usually updated through SGD backprop.

Now suppose during a particular backprop that there is a large magnitude gradient passed backwards to R. Since R was open, it will pass this large gradient backwards to its inputs. This causes a relatively large change in the function which computes R's input. This implies that the distribution over R's inputs has changed -- let's say the inputs to R are now distributed as a low-variance Gaussian centered at -0.1.

Now we have that:

Most inputs to R are negative, thus

Most inputs will cause the ReLU gate to be closed, thus

Most inputs will cause gradients to fail to flow backwards through R, thus

R's inputs are usually not updated through SGD backprop.

What happened? A relatively small change in R's input distribution (-0.2 on average) has led to a qualitative difference in R's behavior. We have crossed over the zero boundary, and R is now almost always closed. And the problem is that a closed ReLU cannot update its input parameters, so a dead (dead=always closed) ReLU stays dead.

Mathematically, this is because the ReLU computes the function

r(x)=max(x,0)

whose gradient is:

∇xr(x)=𝟙{x>0}

So the ReLU will close the gate during backprop if and only if it closed the gate during forward prop. A dead ReLU ̶s̶t̶a̶y̶s̶ ̶d̶e̶a̶d̶ is likely to stay dead.

(Edit: As Liu Hu and others have noted, there is still a chance to revive the ReLU. Recall that many of the “upstream” parameters impacting R’s input distribution are still being updated via other paths in the graph. For example, R’s “siblings” include ReLU’s that are open and still passing gradients backwards. These updates to R’s upstream parameters may move R’s input distribution back to having nontrivial support in the positive regime. The details are a bit intricate: note that, e.g. in an affine layer, the row vector that directly provides input to R is genuinely non-updateable until R opens. So a fully dead ReLU can only be revived through updates to the previous linear transformation layer. In particular, you 100% cannot revive a dead ReLU at the first hidden layer if the linear transformation is a typical affine with no parameter sharing. In any case, the matter deserves some empirical study.)

I'm not sure how often ReLU dying happens in practice, but apparently often enough to be something to watch out for. Hopefully you can see why a large learning rate could be a culprit here. The larger the average magnitude of updates from SGD steps, the greater the risk that we might push the entire distribution of R's inputs over to the negative regime.

Finally unlike my example you can have this done in stages, where one SGD step pushes part of the distribution off of the positive regime, and a subsequent SGD step pushes the rest of the distribution off. Generally you can think of R's input distribution as following a kind of random walk. If a relatively large step is taken, you might wind up with a dead ReLU.

A way to avoid this would be to avoid nonlinearities which have zero gradient regimes (e.g. "Leaky ReLU"), but I'm not totally sure if we want/need this. One could also consider trying to "recover" dead ReLU's by detecting them and randomly reinitializing their input parameters -- this would slow down learning but might encourage more efficient use of parameters.

results matching ""

    No results matching ""