You are right about Clickhouse, but other data warehouses are not optimized for ...

manigandham · on April 15, 2020

All data warehouses are designed for OLAP, that's their purpose. It doesn't require low latency though.

Redshift is an always running cluster of scale-out distributed postgres forked by AWS so it can and does return results in milliseconds, very similar to Clickhouse although still not as advanced in performance techniques.

Bigquery is a completely managed model that uses far greater scale-out architecture designed for througput (petabytes in seconds) rather than latency, although it has real-time streaming, BI Engine (memory cache) and materialized views so you can get pretty close today.

Snowflake is another option that runs on top of major clouds using instances to access object storage and also has low latency when your cluster is running.

DevKoala · on April 15, 2020

Not all data warehouses are designed for OLAP. As the performance of these systems increase, it is possible to satisfy OLAP use cases with some warehouses such a BQ, but they are still not ideal as your number of concurrent queries scale and your response time requirements remain constant. The BQ team certainly doesn't consider their product an OLAP solution, neither the RedShift team. AWS is actually working on their own OLAP solution, Timestream; that is how they pitch it, as their OLAP solution.

BigQuery's petabyte aggregations in seconds is a false claim. I just launched a query to count the number of records of a 1.4 PB dataset and it has taken over a minute to just schedule the query, I have 1000+ compute units available too. I have been able to perform a similar query in seconds on top of Druid; the cluster had a run rate of $25k/month at the time, while I believe we pay 40k+/month just on our BQ deal. Maybe under ideal conditions BigQuery can run such a query, but I have never seen it do such things.

I have experience with Snowflake, it is great if you are in AWS, but I wouldn't choose it over BigQuery. I feel that Snowflake's strength is in their data sharing protocol, which allows us to tap into the datasets of partners without major data transfer costs. We use it for that reason only.

manigandham · on April 15, 2020

You're using a strange definition of OLAP here when it's really not tied to performance or concurrency. It only means you can send an ad-hoc query and get back results directly in response, rather than waiting on pre-computed or batched processing happening somewhere else.

Redshift and BigQuery are both OLAP products and their documentation and whitepapers state the same. The performance profiles of both are very different but that's due to the architecture. Redshift is an always running cluster of EC2 instances running their forked scale-out Postgres. BigQuery is an entirely different engine based on Dremel which focuses on parallelization as much as possible along with Google's proprietary datacenters that have petabits of internal bandwidth to support it. AWS Timestream is an entirely different product for time-series specific data, and basically

The only reason Druid is faster is because it pre-aggregates and indexes data. You can aggregate tables yourself in Redshift and use the right primary keys to get most of the same performance. BigQuery just launched materialized views and is working on scheduling improvements and in-memory caching as well. Snowflake now runs in Azure and GCP and can do sub-second responses when the warehouse is running and data is cached or the tables are smaller.

DevKoala · on April 16, 2020

I wasn’t able to achieve the same performance with RedShift aggregates, I tried that first before I decided to migrate from RedShift to Druid back in 2014. We deal with dozens of dimensions per event and no combination of distribution keys in Redshift was able to give up the same performance over arbitrary scans+ aggregations.

Druid is not only fast because it pre-aggregates, but the memory structure is designed for scans over hundreds of dimensions.

Materializing views in BigQuery is just one DAG task. Unless you don’t have something like Airflow on your stack, I don’t see how it is worth mentioning. We are talking about denormalized data, time series data.

I am speaking from experience with each one of these products. Perhaps I did it all wrong, but we certainly achieved the objectives we were after.

esprehn · on April 15, 2020

I think the limit on BigQuery is 100 concurrent queries. Also you can use batch mode if you're willing to wait a little bit.

https://cloud.google.com/bigquery/quotas#query_jobs

DevKoala · on April 15, 2020

We have negotiated a flat rate deal with Google and have access to a higher concurrent query limit than 100, and even then, it is not the right tool for my use case. BigQuery is also far more expensive.

Our users expect responses in the milliseconds. The user experience is seriously degraded when you approach tens of seconds or batch queries.