FWIW this should be the standard practice anyway. Your dev environment should be...

efficax · on Dec 15, 2021

Taking everything? I'll call you in 2 weeks when the indexes are done building

dijit · on Dec 15, 2021

I am finding it difficult not to reply with snark, because I'm quite sure that 2weeks of downtime to restore your systems in a data corruption or complete failover scenario is not reasonable to your directors.

But, even that said: you can copy the binary files over to a new machine (copy-on-write snapshots -> rsync) -> store a copy -> start up the database, sanitise -> ship around to dev envs.

sa46 · on Dec 15, 2021

What happens when prod is a few hundred terabytes or you use logical replication to stream changes to handle major version updates? The GP’s point was shipping 100% of a large database isn’t feasible.

You’re conflating dev environments with restoring backups. Those can be the same thing but are often separate.

dijit · on Dec 15, 2021

If your prod environment is hundreds of terabytes then making good dev environments is even more crucial and you can’t run things locally.

If you’re running hundreds of terabytes then the systems in place to shard that data must be well tested.

Migrations must happen on similarly sized data, along with various distributed transaction guarantees because I doubt you’re going to be using dedicated-attached storage for that. And if you do then testing multipath needs to be part of your testing too.

Is it expensive? Yes. But that’s what working with that amount of data costs.

Or is this a strawman intended to stump me, because I have dealt with such “data requirements” before and when they saw the sticker price of doing things properly suddenly those hundreds of terabytes weren’t as “required” anymore.

satyrnein · on Dec 15, 2021

You can also do zero copy clones of production in Aurora, Snowflake, etc, so you don't have to duplicate the whole thing.

smoe · on Dec 15, 2021

Do you have recommendations for guides and tools to automatically get a sanitized subset of prod data of Postgres for development?

I haven't looked into it in a while and the last time we ended up rolling our own.

dijit · on Dec 15, 2021

I don't think there are ready made tools, it's usually custom in every environment I’ve worked in.

Any tool for doing this would have to be so generalised as to be extremely difficult to configure I believe (as difficult maybe as setting it up with custom shell scripts)

shabble · on Dec 15, 2021

you could have a look at dblab[1] which (afaik, I've not yet tried using them) has some support for streaming in from a primary source and applying sanitisation functions/transforms

The main value is the use of ZFS snapshots to give you almost-instant (2-3s for a 20G DB on my dev laptop) writeable clone of an import, which you can test your migrations etc against, and then just revert or destroy, which has been extremely helpful for me.

Happy user, no relationship, etc.

[1] https://postgres.ai/products/how-it-works