Your dev environment should be running on a sanitised version of prod data.
Some people prefer to operate on a subset (10%~) of prod data, I have always preferred to take the cost of taking everything.
This has a nice consequence of allowing you to test automated restores at regular intervals, but you must be mindful of the "fail-open" nature of sanitising data. (IE: new data must be known to the sanitisation tool).
Pipeline usually goes:
Prod/live -> Replica -> Backup/Snapshot -> Copy to archive -> Copy from archive via streamed sanitiser -> sanitised backup -> Restore[n] Dev Envs from same backup.
I am finding it difficult not to reply with snark, because I'm quite sure that 2weeks of downtime to restore your systems in a data corruption or complete failover scenario is not reasonable to your directors.
But, even that said: you can copy the binary files over to a new machine (copy-on-write snapshots -> rsync) -> store a copy -> start up the database, sanitise -> ship around to dev envs.
What happens when prod is a few hundred terabytes or you use logical replication to stream changes to handle major version updates? The GP’s point was shipping 100% of a large database isn’t feasible.
You’re conflating dev environments with restoring backups. Those can be the same thing but are often separate.
If your prod environment is hundreds of terabytes then making good dev environments is even more crucial and you can’t run things locally.
If you’re running hundreds of terabytes then the systems in place to shard that data must be well tested.
Migrations must happen on similarly sized data, along with various distributed transaction guarantees because I doubt you’re going to be using dedicated-attached storage for that. And if you do then testing multipath needs to be part of your testing too.
Is it expensive? Yes. But that’s what working with that amount of data costs.
Or is this a strawman intended to stump me, because I have dealt with such “data requirements” before and when they saw the sticker price of doing things properly suddenly those hundreds of terabytes weren’t as “required” anymore.
I don't think there are ready made tools, it's usually custom in every environment I’ve worked in.
Any tool for doing this would have to be so generalised as to be extremely difficult to configure I believe (as difficult maybe as setting it up with custom shell scripts)
you could have a look at dblab[1] which (afaik, I've not yet tried using them) has some support for streaming in from a primary source and applying sanitisation functions/transforms
The main value is the use of ZFS snapshots to give you almost-instant (2-3s for a 20G DB on my dev laptop) writeable clone of an import, which you can test your migrations etc against, and then just revert or destroy, which has been extremely helpful for me.
Your dev environment should be running on a sanitised version of prod data.
Some people prefer to operate on a subset (10%~) of prod data, I have always preferred to take the cost of taking everything.
This has a nice consequence of allowing you to test automated restores at regular intervals, but you must be mindful of the "fail-open" nature of sanitising data. (IE: new data must be known to the sanitisation tool).
Pipeline usually goes:
Prod/live -> Replica -> Backup/Snapshot -> Copy to archive -> Copy from archive via streamed sanitiser -> sanitised backup -> Restore[n] Dev Envs from same backup.