Hacker News new | past | comments | ask | show | jobs | submit login

FWIW this should be the standard practice anyway.

Your dev environment should be running on a sanitised version of prod data.

Some people prefer to operate on a subset (10%~) of prod data, I have always preferred to take the cost of taking everything.

This has a nice consequence of allowing you to test automated restores at regular intervals, but you must be mindful of the "fail-open" nature of sanitising data. (IE: new data must be known to the sanitisation tool).

Pipeline usually goes:

Prod/live -> Replica -> Backup/Snapshot -> Copy to archive -> Copy from archive via streamed sanitiser -> sanitised backup -> Restore[n] Dev Envs from same backup.




Taking everything? I'll call you in 2 weeks when the indexes are done building


I am finding it difficult not to reply with snark, because I'm quite sure that 2weeks of downtime to restore your systems in a data corruption or complete failover scenario is not reasonable to your directors.

But, even that said: you can copy the binary files over to a new machine (copy-on-write snapshots -> rsync) -> store a copy -> start up the database, sanitise -> ship around to dev envs.


What happens when prod is a few hundred terabytes or you use logical replication to stream changes to handle major version updates? The GP’s point was shipping 100% of a large database isn’t feasible.

You’re conflating dev environments with restoring backups. Those can be the same thing but are often separate.


If your prod environment is hundreds of terabytes then making good dev environments is even more crucial and you can’t run things locally.

If you’re running hundreds of terabytes then the systems in place to shard that data must be well tested.

Migrations must happen on similarly sized data, along with various distributed transaction guarantees because I doubt you’re going to be using dedicated-attached storage for that. And if you do then testing multipath needs to be part of your testing too.

Is it expensive? Yes. But that’s what working with that amount of data costs.

Or is this a strawman intended to stump me, because I have dealt with such “data requirements” before and when they saw the sticker price of doing things properly suddenly those hundreds of terabytes weren’t as “required” anymore.


You can also do zero copy clones of production in Aurora, Snowflake, etc, so you don't have to duplicate the whole thing.


Do you have recommendations for guides and tools to automatically get a sanitized subset of prod data of Postgres for development?

I haven't looked into it in a while and the last time we ended up rolling our own.


I don't think there are ready made tools, it's usually custom in every environment I’ve worked in.

Any tool for doing this would have to be so generalised as to be extremely difficult to configure I believe (as difficult maybe as setting it up with custom shell scripts)


you could have a look at dblab[1] which (afaik, I've not yet tried using them) has some support for streaming in from a primary source and applying sanitisation functions/transforms

The main value is the use of ZFS snapshots to give you almost-instant (2-3s for a 20G DB on my dev laptop) writeable clone of an import, which you can test your migrations etc against, and then just revert or destroy, which has been extremely helpful for me.

Happy user, no relationship, etc.

[1] https://postgres.ai/products/how-it-works




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: