Hacker News new | past | comments | ask | show | jobs | submit login

Ah that makes sense! Looking at their results and how RedShift was so much faster in every scenario, it looked like something was amiss. Is Parquet Cloudera-only like Impala or is it available with vanilla Hadoop?



Impala isn't technically Cloudera only, it's open source (https://github.com/cloudera/impala), and other people have gotten it to run on their Hadoop distribution, but since it's developed by Cloudera, it was developed to run on the CDH platform (Hadoop).

Parquet was a joint effort between Cloudera and Twitter, and now it's being developed by many other companies. You can use it with Hive, Pig, MapReduce, Cascading, Crunch and I think Apache Drill's first milestone has adopted it as a columnar format as well. Parquet also allows you to use your Avro or Thrift schema (soon Protobuffs) to write Parquet data, too.

It's a separate project in the ecosystem and has its own roadmap (https://github.com/Parquet/parquet-mr).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: