Not all data warehouses are designed for OLAP. As the performance of these syste...

manigandham · on April 15, 2020

You're using a strange definition of OLAP here when it's really not tied to performance or concurrency. It only means you can send an ad-hoc query and get back results directly in response, rather than waiting on pre-computed or batched processing happening somewhere else.

Redshift and BigQuery are both OLAP products and their documentation and whitepapers state the same. The performance profiles of both are very different but that's due to the architecture. Redshift is an always running cluster of EC2 instances running their forked scale-out Postgres. BigQuery is an entirely different engine based on Dremel which focuses on parallelization as much as possible along with Google's proprietary datacenters that have petabits of internal bandwidth to support it. AWS Timestream is an entirely different product for time-series specific data, and basically

The only reason Druid is faster is because it pre-aggregates and indexes data. You can aggregate tables yourself in Redshift and use the right primary keys to get most of the same performance. BigQuery just launched materialized views and is working on scheduling improvements and in-memory caching as well. Snowflake now runs in Azure and GCP and can do sub-second responses when the warehouse is running and data is cached or the tables are smaller.

DevKoala · on April 16, 2020

I wasn’t able to achieve the same performance with RedShift aggregates, I tried that first before I decided to migrate from RedShift to Druid back in 2014. We deal with dozens of dimensions per event and no combination of distribution keys in Redshift was able to give up the same performance over arbitrary scans+ aggregations.

Druid is not only fast because it pre-aggregates, but the memory structure is designed for scans over hundreds of dimensions.

Materializing views in BigQuery is just one DAG task. Unless you don’t have something like Airflow on your stack, I don’t see how it is worth mentioning. We are talking about denormalized data, time series data.

I am speaking from experience with each one of these products. Perhaps I did it all wrong, but we certainly achieved the objectives we were after.