Redshift looks to be order-of-magnitude faster than Impala or Shark in all the test. Does this mean that once RedShift supports user-defined functions, there is no competing solution that is any match? (Unless you want to avoid using the cloud)
The benchmark above is testing Impala with SequenceFiles compressed with GZIP, against RedShift, which is not a fair comparison.
In the "What's next?" section, they say they want to re-do the Impala tests using Parquet, which is a columnar format based on the Dremel whitepaper (http://parquet.io/).
Ah that makes sense! Looking at their results and how RedShift was so much faster in every scenario, it looked like something was amiss. Is Parquet Cloudera-only like Impala or is it available with vanilla Hadoop?
Impala isn't technically Cloudera only, it's open source (https://github.com/cloudera/impala), and other people have gotten it to run on their Hadoop distribution, but since it's developed by Cloudera, it was developed to run on the CDH platform (Hadoop).
Parquet was a joint effort between Cloudera and Twitter, and now it's being developed by many other companies. You can use it with Hive, Pig, MapReduce, Cascading, Crunch and I think Apache Drill's first milestone has adopted it as a columnar format as well. Parquet also allows you to use your Avro or Thrift schema (soon Protobuffs) to write Parquet data, too.
Note that the suggested benchmark (https://amplab.cs.berkeley.edu/benchmark/) is a slightly modified version of the Hive Benchmark. Both of these are just 3-4 tables total and 4 very basic queries. I recommend looking at something more realistic (e.g. TPC-DS, TPC-H, etc).
"Here from HackerNews? This was originally posted several months ago. Check back in two weeks for an updated benchmark including newer versions of Hive, Impala, and Shark."
https://amplab.cs.berkeley.edu/benchmark/