> I don't know how to estimate what values of n would be relevant for today'...

> I don't know how to estimate what values of n would be relevant for today's applications, but n=2^22≈4M doesn't seem like such a big number for images or video (if signal length roughly corresponds to number of pixels). Any thoughts?

Transform codecs for video typically don't transform an entire frame at a time, but rather blocks of the frame. This is for three reasons. The first is efficiency - most of these codecs date back to a time when huge transforms (FFT, DCT, etc) were very expensive. Possibly the most important is the local artifacts are less noticable than global artifacts in video, and using blocks is a nice way to bound that. Finally, the point of using a transform is to be able to extract some 'structure' from the data that can be used to encode it in a more efficient way. Small blocks make this easier to do in a perceptually acceptable or transparent way.

There are absolutely applications for large transforms, but video encoding typically isn't one. Incidentally, many modern approaches are based on other transforms, like the discrete wavelet transform, or approximations to transforms like the DCT.