I don't see where it addresses problem of generating large enough number of shapes that are hard enough to recognize with computer vision.
I'm afraid that nice gesture recognition algorithm is not enough to defend against bots programmed to recognize known solutions & replay (slightly randomized) predefined answers.
You need a lot more shapes… but there aren't many shape/size/position combinations that are easy for humans.
You need complicated images, as plain shape/background (and in general shapes on background that are separable on histogram) are easy to trace.