Although a little out of date, there is a website dedicated to this: https://amp...

dude_abides · on Nov 7, 2013

This is awesome, thanks for sharing!

Redshift looks to be order-of-magnitude faster than Impala or Shark in all the test. Does this mean that once RedShift supports user-defined functions, there is no competing solution that is any match? (Unless you want to avoid using the cloud)

monstrado · on Nov 7, 2013

The benchmark above is testing Impala with SequenceFiles compressed with GZIP, against RedShift, which is not a fair comparison.

In the "What's next?" section, they say they want to re-do the Impala tests using Parquet, which is a columnar format based on the Dremel whitepaper (http://parquet.io/).

dude_abides · on Nov 7, 2013

Ah that makes sense! Looking at their results and how RedShift was so much faster in every scenario, it looked like something was amiss. Is Parquet Cloudera-only like Impala or is it available with vanilla Hadoop?

monstrado · on Nov 7, 2013

Impala isn't technically Cloudera only, it's open source (https://github.com/cloudera/impala), and other people have gotten it to run on their Hadoop distribution, but since it's developed by Cloudera, it was developed to run on the CDH platform (Hadoop).

Parquet was a joint effort between Cloudera and Twitter, and now it's being developed by many other companies. You can use it with Hive, Pig, MapReduce, Cascading, Crunch and I think Apache Drill's first milestone has adopted it as a columnar format as well. Parquet also allows you to use your Avro or Thrift schema (soon Protobuffs) to write Parquet data, too.

It's a separate project in the ecosystem and has its own roadmap (https://github.com/Parquet/parquet-mr).

justinerickson · on Nov 8, 2013

Note that the suggested benchmark (https://amplab.cs.berkeley.edu/benchmark/) is a slightly modified version of the Hive Benchmark. Both of these are just 3-4 tables total and 4 very basic queries. I recommend looking at something more realistic (e.g. TPC-DS, TPC-H, etc).

flyovercountry · on Nov 7, 2013

The file format for hadoop tools is a sequence file. For most queries this is the second slowest format, after text.

Rcfile or parquet would be a more interesting benchmark.

pkj · on Nov 7, 2013

From the website:

"Here from HackerNews? This was originally posted several months ago. Check back in two weeks for an updated benchmark including newer versions of Hive, Impala, and Shark."