"I initially got backprop wrong both times, comparison with numerical differentiation was critical! It is interesting that things still train even when various parts are pretty wrong — as long as the sign is right most of the time, progress is often made."
That is the bane of doing probabilistic code. Errors show up not as clear cut wrong values or crashes but as subtle biases. You are always wondering, even when it is kinda working, is it REALLY working or did I miss a crucial variable initialization somewhere?
There might be something deeper there. I am thinking of the line of research, associated with Bengio, about biologically-plausible backprop - it turns out that you can backprop random weights and backprop will still work! Which is important because it's not too plausible that the brain is calculating exact derivatives and communicating them around to each neuron individually to update them, but it can send back error more easily.
It's actually not very different from graphic programming, where a simple rounding error can cause all kinds of troubles, from very small (surfaces or ray not reflecting perfectly where they should be) or very big (completely messing your entire rendering).
Both activities very much resemble chaotic systems and they are both very challenging to debug.
That is the bane of doing probabilistic code. Errors show up not as clear cut wrong values or crashes but as subtle biases. You are always wondering, even when it is kinda working, is it REALLY working or did I miss a crucial variable initialization somewhere?