Can you speak briefly to the change? What motivated it? How's it been working out?
For context, I'm an engineer that uses Postgres heavily, with some basic BigQuery + Redshift. I don't understand what the benefit of something like Clickhouse are over a standard warehouse.
Druid and Clickhouse are OLAP systems. The key here is “online” as in they are optimized to answer queries in a matter of milliseconds. Maybe things have improved, but RedShift is not really an OLAP system. The last time I used RedShift was back in 2015 and it took it tens of seconds or even minutes to answer the queries that Druid could answer in just a few seconds; the Druid cluster was a third of the cost too, AWS was charging us too much.
I love BigQuery, but I look at it more as a background processor for queries over huge datasets. With BigQuery you have some real limitations on the number of concurrent queries. Even if BQ was able to answer in seconds, your users would be reaching your concurrent query limit all the time.
In RedShift you can perform joins, but in Druid and Clickhouse you cannot, your data has to be denormalized. There are quasi-joins but they are not the same. If you are going to swap Redshift with Druid or Clickhouse, then you need to denormalize the data or design around the specific quasi-join concepts of these databases that abstract dictionary lookups.
Druid is operations heavy compared to Clickhouse, while RedShift is just a few button presses. I started using Druid in 2014 and back then to set up a cluster you needed a decent understanding of orchestration. There are plenty of “minimal ops” solutions for Druid nowadays, but IMHO all they are doing is abstracting the problem and not really solving it. To run a Druid cluster, you need 5 or 6 node types and to arrive at the proper configuration of resources for your data needs will take you a lot of time. Clickhouse requires a single node type.
Ingesting data is easy in RedShift, but you will have to work a bit more with Druid and Clickhouse. Two of the druid node types are dedicated to ingestion, so your data pipelines should allocate these dynamically. Loading data into Clickhouse is as simple as asking it to ingest a file, but since it is big data, you need a dynamic place to stage such file right? All I am saying is that I wish there was an easier way to pick up the data from private S3 buckets.
Performance is a wash between Druid and Clickhouse for me. I never did a proper head to head between the two. However, the product I built on top of Clikhouse is a next get version of the product I built on top of Druid, and in my experience Clickhouse achieves the same performance for a third of the costs. However, you must account that cloud computing has gotten cheaper over the years and my numbers for Druid come from 2017 which is the last time I approved a bill.
I forgot to mention PG. I have put small analytic datasets, 80 million rows or so, in Postgres with good results, but even then the response time was never less than a few seconds.
> I forgot to mention PG. I have put small analytic datasets, 80 million rows or so, in Postgres with good results, but even then the response time was never less than a few seconds
There is already a bewildering number of options in this space, but did you try TimescaleDB (a time-series extension for Postgres)?
I'm using it with 100 million plus rows, and aggregate queries return in less than 100ms, even on my laptop - it really is very impressive.
I haven't used them yet, but it also supports continuous aggregates (automatic roll ups), which would make common queries near instantaneous.
Online refers to interactive queries that work and return results after submitting them, in contrast to offline or batch queries. There's no hard latency requirement and standard TPC-H data warehouse benchmark queries can take hours to run.
The main advantage of Druid is that it pre-aggregates and indexes everything resulting in much smaller summarized data with fast seeks to specific rows. Materialized views, stored procedures, and ETL processes can provide the same result in relational data warehouses and come with much better SQL support as you mentioned.
You should try Redshift again as it's come a long ways since the early days. Also I highly recommend MemSQL which is faster than Clickhouse with a much easier deployment, more usability and can serve OLTP workloads in the same database.
Filtering by partial string match on dimensions. It is equally slow on both of them.
For straightforward filters and aggregations, yeah, Clickhouse is faster. However I read a PR by the Druid team where they added a vector engine. If they release that, then the performance gap should be smaller or maybe in Druid’s favor depending on the dataset.
You are right about insertions up to a point.
If you consider data loads as insertions, with Druid I could scale my cluster elastically to speed them up. With Clickhouse I am bound by the query nodes.
Also, Druid can ingest data in real time using a special type of node that wasn’t part of the original distribution. I haven’t done real time data ingestion on Clickhouse. Hourly updates are good enough for my use case.
For context, I'm an engineer that uses Postgres heavily, with some basic BigQuery + Redshift. I don't understand what the benefit of something like Clickhouse are over a standard warehouse.