I'm not a NN expert, but based on what I have found, the point of using RELU ins...

I'm not a NN expert, but based on what I have found, the point of using RELU instead of other activation functions is what is called the problem of "vanishing gradients".

Basically (IIRC), during backprop the error difference gets ever smaller the further back in layers you go, ultimately getting "lost in the noise", making learning in the earlier layers more difficult to impossible.

I'm not saying RELU is the only option to make this work, or that it's the only activation function that provides a "fix" for the issue; I'm sure there are other ways to deal with vanishing gradients that I don't know about.

I also lack the mathematical knowledge as to why RELU helps in this manner, but I suspect something having to do with the lack of "asymptotic structure" approaching the extremes (I don't know what the proper term would be). Or maybe it allows for some form of "forgetting", in the prevention of multiplying very small numbers (such values just go to zero ultimately)?

Maybe someone else here with the knowledge can explain it better, and we can both learn...?