Writing a Postgres Foreign Data Wrapper for Clickhouse in Go

tristor · on Nov 19, 2020

I wonder if you've benchmarked your Go implementation against the Percona Lab's FDW for Clickhouse? https://github.com/Percona-Lab/clickhousedb_fdw

arunk-s · on Nov 20, 2020

Hi, Author here.

I have to admit, I haven't done any benchmarking against the existing FDWs for Clickhouse. But I actually wrote the Go FDW 2 years ago(around Dec 2018). ;)

There weren't any Clickhouse FDW available at that time and I probably would've tried them as well.

Now I just got around to write the blog post and convincing the team to release the code.

Though I have a suspicion that the percona FDW might win in the benchmarks as they won't have to pay the penalties when crossing the Go land to C land.[1]

1: https://www.cockroachlabs.com/blog/the-cost-and-complexity-o...

tristor · on Nov 20, 2020

Thanks for the reply. I really appreciate your post and the additional FDW. If you ever decide to give it a go, a benchmark would be interesting, if for no other reason than to see how much penalty cgo actually incurs vs possible efficiency gains in your implementation strategy.

arunk-s · on Nov 20, 2020

Yes for sure. Definitely a benchmark would be a very solid approach to see the efficiency differences.

AhtiK · on Nov 20, 2020

There's now also a more recent CH FDW written in C by the team at adjust, https://github.com/adjust/clickhouse_fdw

tristor · on Nov 20, 2020

Good find, looks like a more up to date fork of the Percona FDW.

AhtiK · on Nov 20, 2020

Yeah, the original Percona FDW hasn't been active for a while, only had a few commits after the first release and was missing quite a few features. Also doesn't support anything more recent than Postgres 11.

So the "adjust" CH FDW is the only usable choice today, thankfully quite actively maintained, primarily by the Adjust devs. I'd presume it's used for their production load.

stareatgoats · on Nov 19, 2020

I sometimes fondly remember my time as an office hero, using MSAccess to attach to just about any data source imaginable, copy to a temptable and run whatever cleaning was required before loading into the server database.

Good times; almost 25 years ago now. Sometimes I wonder if we're stuck.

daniellarusso · on Nov 20, 2020

I had actually considered doing this recently with a new hire.

I purchased them a license with Access included, but ai have not had time to play with it to talk to MySQL 5.6.

stareatgoats · on Nov 20, 2020

It is embarrassingly easy, once you get over the (minimal) hurdle of setting up the ODBC connections, docs over here for MySQL 5.6: https://dev.mysql.com/doc/connector-odbc/en/

In MSAccess, just go to "External Data" -> "New Data Source" -> "From Other Sources" -> "ODBC Databases" (chose whether to link or import the data) and then create a new data source under "Machine Data Sources" using the MySQL driver.

It became like muscle memory after a while :-)

gscho · on Nov 19, 2020

This is really cool. I had not heard of Foreign Data Wrappers for Postgres before! Are these used in production commonly or more of a toy thing?

mildbyte · on Nov 19, 2020

We use them in production at Splitgraph [0] to power our DDN (like a CDN, but for data). We make a PostgreSQL-compatible endpoint available to the public to query any of the tens of thousands of open datasets by referencing them as virtual tables: they're not hosted by us but we proxy to them using Postgres FDWs. When a query comes in, we intercept it and redirect it to a FDW instance that handles query translation and planning from the PG dialect to that of the backend data source.

We wrote an FDW for Socrata-powered [1] government open data portals to query the public datasets that we index in the Splitgraph catalog as a proof-of-concept. However, there are plenty of other FDWs that we're working on integrating to let people add their own backend data sources (RDS, Snowflake etc).

FDW plugin quality varies (some of them can't push down all predicates or JOINs) but it's definitely an interesting way to think about accessing data. We also added a lot of scaffolding around foreign data wrappers in our open-source tool [2] that makes it easy to add a FDW-managed data source to a PostgreSQL instance.

[0] https://www.splitgraph.com/blog/data-delivery-network-launch

[1] https://www.tylertech.com/products/socrata

[2] https://www.splitgraph.com/blog/foreign-data-wrappers

sixdimensional · on Nov 19, 2020

Then you might be even more surprised to find out that Foreign Data Wrappers are Postgres' implementation of the ANSI SQL standard extension known as "SQL/MED" [1], where MED stands for "management of external data".

I spent a number of years working on a data federation/virtualization engine - and SQL/MED is very much related to that.

It's actually a relatively unknown topic by many software/data engineers I have worked with, but things like GraphQL federation (example, Apollo GraphQL) or some of the more popular tools such as Presto, Dremio, Denodo, etc.) are where more advanced versions of this are today.

SQL/MED and what Postgres can do is quite cool, but just know that any time you have system boundaries you cross (e.g. between heterogenous systems), things like joins, data types, and many other things becoming a bit more difficult - or you just have to think about them more. But very cool tech.

I've used SQL/MED in Postgres, FDW, linked servers in SQL Server, database links in Oracle, and more advanced virtualization/federation engines also.

If you haven't been exposed to this area before, highly recommended as another tool to know about for the toolkit.

[1] https://en.wikipedia.org/wiki/SQL/MED

kid_atticus · on Nov 19, 2020

There are generally two classes for FDWs: Postgres<->Postgres and Postgres->Everything else.

The first one is generally suitable for production and is very useful for sharded Postgres when you want to communicate across shards without having to go back out through the application.

The second one's mileage really varies. Some implementations might or might not be prod ready or mig target only specific version combinations. Can be very useful for data engineering or analytics use cases for quick ETL into a staging database. Or for data migrations between database vendors.

rch · on Nov 19, 2020

This could be of interest to you:

Ville Tuulos - How to Build a SQL-based Data Warehouse for 100+ Billion Rows in Python

PyData SV 2014 - In this talk, we show how and why AdRoll built a custom, high-performance data warehouse in Python which can handle hundreds of billions of data points with sub-minute latency on a small cluster of servers. This feat is made possible by a non-trivial combination of compressed data structures, meta-programming, and just-in-time compilation using Numba, a compiler for numerical Python. To enable smooth interoperability with existing tools, the system provides a standard SQL-interface using Multicorn and Foreign Data Wrappers in PostgreSQL.

https://www.youtube.com/watch?v=xnfnv6WT1Ng

arunk-s · on Nov 19, 2020

I definitely think they are used in production though I haven't tried it myself.

But I can find ancedots of people using it production on the web[1].

1: https://carto.com/blog/postgres-fdw/

noefingway · on Nov 19, 2020

Currently using this one https://github.com/pramsey/pgsql-ogr-fdw for an image migration project. Using it to pull image metadata thru ODBC. While it's primary design is for GIS data, it works well for any ODBC database.

caseyohara · on Nov 19, 2020

Here's a great list of Postgres FDWs: https://wiki.postgresql.org/wiki/Foreign_data_wrappers

I had no idea PG has native FDWs for Twitter and S3. That's pretty awesome.

AhtiK · on Nov 20, 2020

It's been used quite a lot, especially when dealing with legacy databases or having to support multiple databases with a single codebase.

There is one concern though. Both Google Cloud SQL and AWS RDS support only postgres_fdw, so one should manage their own storage, cluster and backups.

fork1 · on Nov 20, 2020

I've written a DB2 FDW that is used in production. Not as good as direct access of course, but very practical still.

_ph_ · on Nov 19, 2020

I have done something similar. For a software, which take shared libraries - usually written in C - as plugins, I wrote such a plugin in Go. This was a very good experience. It took only little work to set up the C compatible Go functions realizing the API, but the rest I could implement in Go which made the life so nice. I also ended up calling back into some of the applications APIs from Go, that worked seamlessly.

Go has good facilities interfacing with C, the only attention you need to pay is properly handling C pointers (manual memory management) vs. Go pointers (automatically managed via the GC). But with very little care this is not a big issue. The Go part of the code is however much nicer than if you had to implement the functionality in C. (Yes, I do think that Go is a great C replacement)

gscho · on Nov 19, 2020

I only just discovered foreign data wrappers today but it seems like you can write these in python though right? Might be easier than using Go to interface with C.

tankenmate · on Nov 20, 2020

Messagebird uses Go on the backend; I suspect that's why they wrote the FDW in Go.

arunk-s · on Nov 20, 2020

Hi, Author here!

Yes, your hypothesis is correct. We chose Go, mostly because almost all of us were quite familiar with it and there weren't many C experts :)

daniellarusso · on Nov 20, 2020

So, naive question, and I know the concept of a ‘database’ is a bit different between MySQL and Postgres, does Postgres require a FDW to communicate between multiple databases on the same server?

mewwts · on Nov 20, 2020