I don't quite understand how he gets a 10,000x speedup from a 100x transistor count decrease. Does die area increase with the square of transistor count?
What he's doing is representing numbers with their logarithms, with limited precision. A floating-point multiplier/divider, then, turns into a fairly small adder, which is much smaller and faster. Square roots and squaring turn into bit shifting. They have some clever method for doing addition/subtraction efficiently. And since they can fit all this in a small area with short critical paths, they can clock it very, very fast, and include a lot of them on a chip.
Well, he says it's ~100x faster than a GPU, and GPUs are ~100x faster than CPUs (in the applications for which they are suited), so the 10,000x figure is the speedup from CPUs.