Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you manage direct updates to databases in a production system
96 points by dmundhra on Dec 15, 2021 | hide | past | favorite | 78 comments
Hi HN! When you have a production system running, every now and then there are requirements to made a DB update because of some bug in the code, or for some use cases which were not thought of earlier. How do you handle such updates?

The common way I know is the access is given to a small number of trusted people who make the updates. Are there any other ways or best practices such that the small set of people don't end up becoming bottlenecks, apart from the potential risk of abuse of trust.




Through the years I have many times worked on this problem. I´ve worked with rake tasks when using ruby, and many times in java, with flyway and my own system. Database Migration is not a complicated problem, you just commit to the repository a sequence of SQL/code scripts to execute and keep at the database the current step. The big misconception people have is that they can "rollback" when an issue arises. The true fact is that when there is an issue, it happened because something not planned happened, and this can be spurious data in production, production environment differences, script errors, or whatever, and trying to rollback automatically in an undetermined state is just crazy. That´s why migrations must be always FORWARD, no matter what. You try to mitigate all risks, the best you can, by running the migration in test databases (the perfect scenario is to perform a test in a production data clone, but this is not usually possible) and if anything wrong happens, deal with it, fix the database by hand, make code patches, and after it is solved, try to figure out why this slipped through your QA.


The difference is largely semantic in the sense that you're still going "forwards" with a down migration, but the important bit is that by calling it a rollback you're accepting that the up migration won't be run again if you ever completely tear down the database and start again. Suggesting that you only ever go forwards implies that you'd run the incorrect migration and the fix for it again in the future,but you wouldn't. Calling it a rollback captures that information.


It's not a semantic difference. If you rollback, that's not in the history. That's fine if the rollback perfectly reversed your original migration. But if it doesn't (developers make mistakes), now your database has reached a state that can't be reproduced by running migrations, defeating the entire point.

> Suggesting that you only ever go forwards implies that you'd run the incorrect migration and the fix for it again in the future,but you wouldn't.

I wrote a tool that specifically enforces this for my company, and it's worked well for the past four years in prod, staging and on a dozen dev environments, so I don't think your assertion is true.


> The big misconception people have is that they can "rollback" when an issue arises

I have reasonably working rollbacks, if things go sideways it takes only a few seconds to go undo the latest changes.


I guess this is exactly this misconception. The problem happens when you encounter a scenario where your rollback doesn't work. I.e.:

> when there is an issue, it happened because something not planned happened


In 12+ years it worked every time.

I understand it didn't work for you, but it does work for me.


I'm curious how you implemented this system.


In my SQL Server database I have a VERSION table with one row and one column that contains the current "version" of the database, and a DowngradeSteps table that keeps track of what needs to be done to undo the changes.

In my project source control I have a SQL script SyncSchema.sql. It contains a series of step-pairs (roll-forward + roll-back):

  if (select V from VERSION)<XXX
  begin
      begin tran 
          update V=V+1 from VERSION
          ***Roll-forward: ALTER TABLE FOO ADD BAR INT***
          insert DowngradeSteps values (
              (select V from VERSION), 
              '***Roll-back: ALTER TABLE FOO DROP COLUMN BAR***'
          )
      commit tran 
   end
where XXX started from zero and is going up every time I need to add a change. The last batch in the file is the highest XXX it knows about, and it is therefore the TARGET db version. This script run upon each deployment, and if you ignore the DowngradeSteps/roll-back part is the usual roll-forward scheme.

Now the roll-back part: once SyncSchema.sql runs through all steps up to the TARGET version, it opens up the DowngradeSteps table and executes all roll-back steps contained therein, that exceed the TARGET. This way I can sync my source control to any point in the past and immediately deploy - the C# code is updated to the desired version and the SQL database is updated to the TARGET version, so I enjoy a consistent picture. It takes literally seconds to do.

The roll-forward and roll-back changes are always coded in pairs and are tested and code-reviewed before before being committed.

I have put this scheme in place 12 years ago and it's been serving me well *

* One caveat is that SQL Server sometimes throws a fit due to lame name binding rules (it binds names inside the IF block even when condition is false), so you have to enclose some of the steps into exec(), and you have to do this proactively(!). This sucks, but maybe your database doesn't do that.


Schema(or data in a database) in your release N should be compatible with release N-1, so you can revert your code to previous working version with new schema. That's the only way to ship new releases without stopping service for maintenance.


A lot of replies talks about schema migration and tools like Liquibase etc. But as I understand the question, it's about running update / delete queries in production environment, not DDL.

In my opinion such queries should be subject of established development process: 1. Commit your query to a repository 2. Send to code review 3. After successful review and merge it should be deployed automatically by ci/cd

It may be necessary to run query directly via console in some cases, though. But such query should be committed to the repository latter any way.

And of course you should use proper tools like comments suggest.


Flyway and Liquibase do update and delete queries fine too, albeit that's not their main intended use case.


Correct. We use Flyway and have it setup through GitLab CI/CD pipelines. We used to have a bottleneck on one of two people with permissions. Now, if data needs to be pushed to the production system for some reason through a script, an Issue/MR in GitLab gets the Flyway script, it must be approved by two people on the team (other than the person who did the work), it must make it through the test system, and then can be pushed to production directly through GitLab. It increases the process but is safer and reduces the bus factor.


This is the way.


Use a tool like https://www.liquibase.org/ or https://flywaydb.org/ to manage updates to databases. You create a migration for your change, check it into git and run this migration on every environment you want to. liquibase for example creates a history table that tracks what migrations are already applied.


If you're doing a update/delete/insert in prod because you really have to (I had this a couple times in the last few months):

- reproduce the issue and fix locally, with prod data copied over if needed (better than guessing)

- save (back up) all data that is being changed somewhere, even if it's on your local machine temporarily (again, better than not having it)

- if your datastore has transactions, run the change and a confirmation query in a transaction/rollback several times before committing

- don't hide what you're doing, tell everyone who will listen that you're mucking about in prod

- if someone will hop on a screen share with you, let them

Main theme here, when making data changes, is test as much as you can and make sure folks know it's happening. If you're making data changes, the code has already failed, so don't feel bad doing what you need to do, but mitigate loss as much as you can. Shit happens, deal with it, but don't make it worse if you can.


You (not you specifically; anyone reading this) need to be careful with copying production data to another machine. This can be considered a security/privacy vulnerability in the eyes of people who care about things like ISO27001 (which is everyone if you’re selling at the enterprise level).

With a sufficiently comprehensive data anonymisation tool you can work around this limitation, but that's an additional investment that a business needs to be aware of up front.


FWIW this should be the standard practice anyway.

Your dev environment should be running on a sanitised version of prod data.

Some people prefer to operate on a subset (10%~) of prod data, I have always preferred to take the cost of taking everything.

This has a nice consequence of allowing you to test automated restores at regular intervals, but you must be mindful of the "fail-open" nature of sanitising data. (IE: new data must be known to the sanitisation tool).

Pipeline usually goes:

Prod/live -> Replica -> Backup/Snapshot -> Copy to archive -> Copy from archive via streamed sanitiser -> sanitised backup -> Restore[n] Dev Envs from same backup.


Taking everything? I'll call you in 2 weeks when the indexes are done building


I am finding it difficult not to reply with snark, because I'm quite sure that 2weeks of downtime to restore your systems in a data corruption or complete failover scenario is not reasonable to your directors.

But, even that said: you can copy the binary files over to a new machine (copy-on-write snapshots -> rsync) -> store a copy -> start up the database, sanitise -> ship around to dev envs.


What happens when prod is a few hundred terabytes or you use logical replication to stream changes to handle major version updates? The GP’s point was shipping 100% of a large database isn’t feasible.

You’re conflating dev environments with restoring backups. Those can be the same thing but are often separate.


If your prod environment is hundreds of terabytes then making good dev environments is even more crucial and you can’t run things locally.

If you’re running hundreds of terabytes then the systems in place to shard that data must be well tested.

Migrations must happen on similarly sized data, along with various distributed transaction guarantees because I doubt you’re going to be using dedicated-attached storage for that. And if you do then testing multipath needs to be part of your testing too.

Is it expensive? Yes. But that’s what working with that amount of data costs.

Or is this a strawman intended to stump me, because I have dealt with such “data requirements” before and when they saw the sticker price of doing things properly suddenly those hundreds of terabytes weren’t as “required” anymore.


You can also do zero copy clones of production in Aurora, Snowflake, etc, so you don't have to duplicate the whole thing.


Do you have recommendations for guides and tools to automatically get a sanitized subset of prod data of Postgres for development?

I haven't looked into it in a while and the last time we ended up rolling our own.


I don't think there are ready made tools, it's usually custom in every environment I’ve worked in.

Any tool for doing this would have to be so generalised as to be extremely difficult to configure I believe (as difficult maybe as setting it up with custom shell scripts)


you could have a look at dblab[1] which (afaik, I've not yet tried using them) has some support for streaming in from a primary source and applying sanitisation functions/transforms

The main value is the use of ZFS snapshots to give you almost-instant (2-3s for a 20G DB on my dev laptop) writeable clone of an import, which you can test your migrations etc against, and then just revert or destroy, which has been extremely helpful for me.

Happy user, no relationship, etc.

[1] https://postgres.ai/products/how-it-works


Great advice. One issue with testing: even once you have a copy of the production DB, it's not actually a live production DB in that it's not being modified and used continuously in the same way. This can hide issues with your change.

My point is, testing on a copy is necessary, but you must combine it with thinking as carefully and thoroughly as you can about the potential side-effects, which in turn depends on your understanding of the application architecture. Some of those side-effects you can test for, others are not so obvious unless you can better simulate the production environment with noisy users who will complain when things go wrong. One strategy is to find people internally who are good at being real life simulators of noisy users who bump into edge cases.


If it's critical problem that affects all users, then yes, small number of trusted people can execute raw SQL updates on db, while being supervised by another trusted person.

In other case, you write small job called fixtool, that goes through normal code review process, then gets deployed once, runs, and gets deleted after it fixes the situation.


Database migration is just code. You create your own database migration tool with a incremental version number and you apply it during deployment.

You can also use a specialized tool. https://datacadamia.com/data/database/migration#tool

Make a branch, test your code and deploy it.


Why would you create your own database migration tool instead of using one of the many existing and bulletproofed tools already?


The bulletproofed tools that breaks left and right and often create migrations which results in down time. I haven't used liquibase, but Flyway frequently breaks and migrations often involves downtime, though I'm sure that can be worked around by carefully creating the changes.

The best solution I've seen is to write the migrations by hand, including rollback scripts. This goes for both schema changes and data changes. Schema changes should, in my opinion, be done out of sync with code changes. The new schema is applied first, if that runs for a few days, code can be updated to use the schema. This allows for easier schema and code rollbacks.


> Flyway frequently breaks and migrations often involves downtime

Just as a counterpoint: In my experience with using Flyway in a professional context with big Postgres instances for the past 6 years we never had a single issue with it.


A migration tool doesn't necessarily imply auto generated migrations. Ruby on Rails for example only generates the base file, but you still need to tell it what the migration is supposed to do.


I've yet to find one that will work for my scenario, where I have multiple database systems and (ugh) elasticsearch mappings in play. The tools I have found are all designed for a single system or are so generic I had to write the bulk of the code, anyway.


If your company suffers heavily from not invented here syndrome, then sure. But just seems like a waste of time when many tools exist.


I'm not sure which question you're asking:

Is this about processes as a small team grows into a larger company? At a certain point your day-to-day software engineers will loose access to production systems. You will need to take away "ops" from the software engineers and have a dedicated "ops" team. Some places call the "ops" team "devops" for political reasons, especially if some founder has a chip on their shoulder about not having dedicated "ops." (Software engineers not touching production systems is industry-normal security practices.)

Is this about how to do the database migration safely? That really depends on your stack, business type, and scalability needs. Assuming you aren't running a hyper-scale product, and you're on a normal database, the easiest thing to do is to have "planned downtime" at a time when your load is lowest. Your business logic layer should return some kind of 5xx error during this period, and your clients / UI should be smart enough to retry after 1-2 minutes. If it's a minor update, (plenty of good advice in this thread,) the downtime should only be 1-2 minutes tops. (The only reason to plan and notify your customers is in case someone is trying to do something business critical and complains.) One thing you can do is "let it be known" that you have 5 minutes of planned downtime Monday night and Thursday night, and that's your window for anything that isn't an emergency.

Is this about the frequency of updates? This is a quality problem, and the only thing you can do is improve your testing and release process to catch these bugs sooner. This is the "growing up" that all small tech companies go through. As you grow, make sure to bring in some people who are mid-late career who've been through these changes. In short, you will need to introduce processes that catch bugs sooner, like automated testing and code coverage. You may find that your "test engineers" write nearly as much test code as your software engineers put into the shipping product.


It's not just hyper-scale products that preclude having nice planned maintenance windows. Serving a handful of countries spread out across the globe is enough to throw the concept out the window. I had an application that started out US-only, and was built assuming that 11pm-7am Eastern was available to do big data imports, run reports, etc. But once a couple other countries got added, there was no longer any time of the day or night where that assumption held true. A lot of processes had to be re-architected because of that.


We are a .NET shop and have been using DbUp (https://dbup.readthedocs.io) for years now. It is fantastic. We had so many false starts managing production databases and keeping them consistent. DbUp has taken all that away, and now every upgrade or patch is repeatable and predictable. It is worth spending time identifying what in your database is idempotent (e.g. stored procs, functions) and what isn't. This way the idempotent objects can be upgraded every time and everything else just receives a single incremental update as required.


If you want to change the data stored in production database tables, following the below rules will be helpful.

1. Please don’t issue I/U/D SQL statements directly. You’d better write up some programs acting as the “revert transactions” to do the data modification. In this way, you don’t need to grant the I/U/D privileges of the production tables to any user IDs. Instead, you give access to those programs. It will help you to remove the human error as much as possible. And the behavior of the programs is more predictable and consistent.

2. You should have a fixed change window for these kinds of data changes. You should not execute those “revert transactions” whenever you want.

3. Then you give the execution access of those programs to the people who need to do the work only during the change window. That is you grant the execute access to the user ID beforehand and revoke the access afterward. Since this is grant/revoke between user ID and programs, it’s much safer. If you have to grant/revoke between user ID and tables, there might be cascade effects.

4. Before the change, capture the “before” data and get ready the fallback script.

5. Don’t forget to turn on the audio options/tools during the change window.

6. If you guys work in a physical office, you can think about binding those revert transactions to a dedicated desktop.

I know these rules are complicated and tedious, but they could protect the team and the data as well :)


Typically I lean to rake tasks (rails dev).. these have spec and go through PR process just like everything else. In worst case we jump on directly.. But each rails instruction or db instruction is logged against the dev who did it (In slack chat that anyone can see, in real time)


Honestly the tooling in Rails for this is the thing I miss most now I don't get to work in it any more. Makes me wonder how anyone gets anything done without it.


The first idea I had was "don't". We have a VM where we do all of our changes and then the actual production sever is just a copy of the VM. So we test things on the VM and if it doesn't break, THEN those same changes are made on the production server.


> The common way I know is the access is given to a small number of trusted people who make the updates.

I think this was a common way of doing it 10+ years ago, but what I've seen work the best is to have your CI/CD process apply the updates. Assuming the migration scripts follow the same process as all your other code (code reviewed, tested, and are runnable), automatically applying these scripts is much less error prone than having a person apply them. Humans are pretty terrible at following a script, which is one of the things computers are great at.

Some objections I've seen to relying on automatic deployments:

- "But we should have a person run them to make sure they do they right thing!" Agreed, you should, but that should happen at test time, or maybe code review time, not right before you are about to apply something into production the Shift Left [1] principle applies to database changes as well as code changes

- "What if something goes wrong?" Well, what if something goes wrong with any of your deployments? Usually, you can roll back (either literally rolling back, or rolling forward, or rolling in some other direction that leaves your system in an ok state), look at the logs to see what went wrong, and try again.

- "But data can get messed up!" Data can get messed up by code changes. You should have a backup/disaster recovery process.

[1] https://en.wikipedia.org/wiki/Shift-left_testing


I use web frameworks that have database migration tools and then it becomes just another code deploy which gets kicked off by CI.

In development I create the migration which ends up being a file that alters the schema. I run the migration command and make sure things work as planned and all tests still pass.

Then I push up the feature branch and it gets code reviewed. This review could either be by yourself or a team member depending on team size.

After everything passes on CI it gets merged and deployed. At this point it depends on how you deploy things but that could be to a staging environment where it runs which gives you extra confidence that it works before rolling it out to production or you could go straight to production if you don't have a staging environment.

As for how it gets run, usually it happens after the new version of your code is available to be pulled but before your web app gets restart. This way it can run while the old version of your app is running, then once it completes you can restart your web app to pick up the new code base that was part of the deploy.

If you're careful about how you migrate things you can do a lot of migrations without hard downtime, but it's usually a balancing act between how complicated you want the migration process to be (multiple steps / deploys) vs how much downtime you can live with (1 step / deploy).

Basically migrations are code that get treated like any other code deploy with a proper review. If it's an especially important migration it wouldn't hurt to get it reviewed by more than 1 person and also perform a production DB backup right before you deploy the migration.


Seconded.

I've used Django migrations. From the dev side it's super easy: some code PRs also have associated migration code. A deploy will run the migration then change the application code. It has sharp edges but works 97% of the time. The framework also generates rollbacks, for easy removal of problematic migrations.


This feels very timely, as I just encountered it yesterday.

The steps I took were:

1. backup production before making any changes

2. import the production backup locally

3. test the script locally, QA the result to ensure desired outcome

4. repeat 2-3 until you feel confident that you're able to consistently get what you need from the script

5. wait until a low volume time for your platform (if possible), run a backup, import and test 1 more time locally, apply the change in production

However, after doing this, a coworker noticed something we had missed when QAing the data. As a result I had to revert the changes, I had a rollback query ready to restore the modified records.

As part of the planning for this, I added a new column to the table in question to track if/when a fix was made to the record. This allowed me to identify the records which were modified to roll them back. While this isn't always practical, it made sense long term for our situation.

A secondary option I've used in the past is exporting the table to be updated and then writing a simple script with the modifications necessary to generate the SQL to update each record individually. I've then used the same script and export to build a revert update query as well just in case something unexpected comes up.

At the end of the day, make sure you're backing up before any changes, plan ahead in case of unexpected errors so that you can revert quickly to avoid downtime instead of scrambling when it goes wrong. Test it a few times locally (change + revert) to try and sort out the issues prior to applying them in production.

edit: formatting


Create a history table that logs each data to a timestamp and user name. The latest timestamp is the current revision. That way you know who last changed the record to what data. It worked for me as a federal contractor.


At a previous employer, we had a 'fire call' to handle these situations. These were emergency situations and required a form submition and authorization from a manager. You were given temporary credentials with access to do what you needed to do.

I only had 2 of these in the 5 years I worked there, but here's an example. We had an internal purchase request system used for puchasing a keyboard or what ever you needed for work. Of course the purchase request went through a chain of approvals starting with your manager and eded with the CTO. The CTO threw a fit for having to approve keyboards and other small items, so it was deemed an 'emergency' to fix it right away. I had to immediately patch the code so he wouldn't see trivial requests. The 'fire call' allowed me to submit the code directly into production without going through the change control procedures, which only happened once per week.

And you better be damn sure that your changes are correct, crap rolls down hill very quickly when it involves very senior people.


> And you better be damn sure that your changes are correct, crap rolls down hill very quickly when it involves very senior people.

That's an organizational problem though. We have a culture where, given a choice between "let it burn" and "quick fix with potential to blow it up (rather than put out the fire)" - responsibility for fallout in case it blows up lays with the person making the decision (the senior manager/cto) - not the person doing the work and describing the trade-off(s) (this should work - but there's a non-trivial chance it can blow up).

That said, work for fixing the aftermath does flow down hill.. It has to.


Oops I accidentally set the threshold to $10k. :)


I work with a 'nosql' database vendor and I take care of customers with PBs of data doing this. A very common pattern is having a DR setup - i.e. two separate clusters, with the database replicated, then breaking that replication, upgrading DR and if it's successful promoting that to acting prod and upgrading production (cluster #1) and re-enabling replication.


On my infrastructure the number of people with direct access to modify the database is zero.

What you do is you write any operations that are to be run on production as code (for example service that can be triggered).

You then get the code reviewed (second pair of eyes) after which it is automatically compiled and delivered to pre-production. On pre-produuction you have a chance to test it against full environment to see if it is breaking anything. If tests results are accepted it can be promoted to prod.

We actually have special service that is only meant to run one-off database operations. It is a webservice where you can trigger jobs.

Jobs do not take any parameters at all -- you just give the name of the job and that's it. This allows the entire operation to be reviewed in full.

Jobs can't run in parallel -- there is a queue and you can schedule a job to run. The job can also be saving progress (and it will restore from the saved progress in case it fails or is cancelled mid-operation).


This is a nice article from the folks at Octopus Deploy on how to solve this problem with their RunBooks feature:

https://octopus.com/blog/database-deployment-automation-adho...


Gated automated deployment (for MSSQL - dacpac via Azure DevOps, similar things exist for other DBs). Usually can be pre-tested if prod db copy is restored on lower environments. If anything, rollback can be made to previous deployment.

Either way, you need to designate someone who is knowledgeable to oversee the process. Automated deployment just makes this process way faster (thus "reducing bottleneck") where this person only needs to review code to see if anything suspicious isn't there and approve the deployment.

Manual deployments are prone to human error, especially under stress/time pressure. If manual deployment "package" (a set of sql scripts) is poorly written, there's huge incentive to "contact directly", which again could lead to manual errors.

The biggest drawback is culture which is the hardest ("we done this way for n years and we don't want to change").


Greatly depends on the database in question. For MySQL I heavily recommend using pt-online-schema-change[1] from Percona Toolkit[2].

In all cases though you should have a change management process where the schema change is documented, reviewed, and approved and only a small number of people have the necessary access to run these types of changes in production. Change management is your friend.

1: https://www.percona.com/doc/percona-toolkit/3.0/pt-online-sc...

2: https://www.percona.com/software/database-tools/percona-tool...


"Pair programming" - don't do it alone, have a second person check every letter typed. Do explicit backups beforehand. And for non-DDL updates, do it inside a transaction where you can verify the results (preferably by running multiple automated reports) before committing.


Very carefully. :)

Migrations are a good place to store code for an update. If it’s an update that will take a while I might use the migration to enqueue a job.


Use your standard DB migration tooling, with code/change review as part of the process.

I’m a big fan of sqitch[1] but many migration tools will handle data updates as well as DDL.

1: https://sqitch.org/


At current and last job we trust all devs with production DB access, but it's logged any time they access a machine that has prod DB access. We also have a migration framework that can create an audit log that can then be used to perform or rollback the migration. It works at the row/object level. There is a review system around migrations and you have to pass your ticket ID to run the migration, and it will check if the ticket was approved.

Former job eventually had a system of request - you would request access for your user for 24 hours.

Job before that only managers had prod db access. Do not recommend if possible.


The best migration is no-migration. That is not having to move data at all. This becomes increasingly more important with larger datasets. One way of doing that is to simply serialize data in something like Protobuf and have a generic schema that rarely/never needs to be altered[1]. Adding a new field to the Protobuf is a no-op.

[1] https://kevin.burke.dev/kevin/reddits-database-has-two-table...


Assuming you’re talking about patch data and not schéma migration, the most effective way to me is direct write access to the database, and transaction usage.

Whenever I need to patch data on a Postgres, the first thing I type is ˋbegin;ˋ to start my transaction. After that I just run my updates safely and check that the impacted data matches what I expected. Then I just need to commit or rollback the changes.

2 things to do before that:

- have the queries reviewed by members of the team knowledgeable about the db - try the query on a dev environment (ideally a preprod with a replica of the data)


If you mean updates to the schema then some sort of migration system surely exists for your framework of choice, such as in Rails and Laravel too I think.

If you mean one-off changes to existing data what has served me well so far is implementing an audit log from the start coupled with making sure that records are fudgeable by a select few admins.

This means that a few people can make changes to nearly everything from the interface rather than having to log into the DB directly. At the same time there is a record of changes done and the ability to comment on and revert them.


If using SQLAlchemy, of course the answer is Alembic[0]. But I like Alembic so much that actually, in a couple cases, I copied the model into Python/SQLAlchemy only to be able to use Alembic for migrations. Of course if you use some other ORM instead, that's a no go, but for dealing with legacy/raw DBs, it's worth the effort to have code do it for you.

[0] https://alembic.sqlalchemy.org/en/latest/


> The common way I know is the access is given to a small number of trusted people who make the updates.

We have the same approach. A very small number of people have write access to the production databases. If things can’t wait for a schema change release (15-30 min) and can’t be done through the back office API, we do it manually. It’s very rare.


I use go migrate. I basically a tool that has a version number associated with a. Sql file. You write an upscript and a fow of script in Sql, commit that to source code. If you use the tool to deploy. It gets a lot for very little extra.


Similar procedure, limited amount of people.

It should be a rare exception not a normal thing, so a few people is fine.

Best practice is to also have your change reviewed beforehand, and run in a transaction where you also validate expected results before committing, etc.


Yep there should be a rollback plan, a change review process prior to running it and successful tests from a test system (which should be identical to prod)


At work, we do the following:

* User reports the issue to a business analyst via a ticket.

* Business analyst confirms the issue, adds any relevant details, and opens a ticket with the technical team.

* The technical team creates a script to perform the update and runs it in the 'support' database (which is refreshed nightly from Production)

* Business analyst validates results

* User validates the results

* Technical team logs a Change Request, it is discussed at the next Change Advisory Board meeting, then the script is run during the next change window (weekly). If it's an emergency, we can run it the same day.

* The Database Administration team runs the script in Production

It sounds like a heavy weight process, but with multiple people involved and actions logged at every step of the way it's a very safe and auditable process.

TL;DR: A small number of trusted people have the required access, there is a fair amount of business process built around using this access.


If this is becoming a bottleneck, there's a process problem, and u need to invest in an automation mechanism. It should be rare.


if you are running .Net, developers would utilize Entity Framework migrations to manage and source control all DB changes. Similarly, Alembic for the Python ecosystem. Treat the application of migration scripts as part of a regular prod deployment process, which is typically limited to a small group of ops staff.


EntityFramework Migrations or DbUp ( someone mentioned it here to).

https://dbup.readthedocs.io/en/latest/


Unfortunately the old-fashined way: do it out of hours with a backup first :(

I've never had to work on a 24/7 critical system so I dunno how to do that.


There is a `fixes/` directory in source control, full of lots of little one-off things that have a ticket number in the filename.


I just set up Basedash yesterday (think Airtable over real SQL databases) and plan to use it for updating a production database.


I would build a rest service that provides the code to do this, and I would test it extensively before releasing it as the mechanism to effect the change.

In addition I would build a validation service that checks and tests the update data (the list of changes) before they are submitted to the change service.

I would not permit any adhoc changes to the prod database. The rest service should be the only way for a mutation to be done short of a complete new release.


So a month worth of engineering effort for something, that you would use once a month ?

If you need to upgrade production stuff more than that you have problems elsewhere.


A month of effort to avoid a day of downtime, and six months of unemployment.

I worked in a business where if the website went down we missed £26m revenue. This is not a particularly big business in the modern sense. However, one MD had a screwed up migration that didn't have a good roll-back plan. There was a missed day of revenue, and he - and several others in his top team - left the business within a calendar month.

This kind of thing colours my view on prod risk.


I use copy,drop,create,restore to keep the db schema up to date.


Know SQL well, so you can know how revert any problem you have.

Other commenters say a lot about PREVENTING an issue. That is good, but you can get stuck in a situation that (very common, saddly) you are under a pile of "stuff" that hide a simple fact:

You CAN ruin and revert nearly everything* in a RDBMs and still get on top.

* as long you are inside SQL/transactions and not messing with the binary data

---

The point here is that you can model the database in ways you get stuck with limited options, or instead make it easier to get of trouble.

The good thing is that is mostly apply good design practiques and is not that problematic.

The key is that data in RDBMs IS values (aka:inmmutables (rows, tables) on top of a mutable facade (INSERT, DELETES), so you can always take snapshot of data and apply reversible operations.

This is what most do with "copy to sql, csv, do a backup, etc" but is in fact more in-built than that. You can, totally, side-step extra operational complexity if you know this fact (and the change is kinda small. I don't say don't make copies or backup, instead, that SQL is VERY friendly to mess with it!)

The major issues is when you turn your powerfull RDBMS in a pathetic datastore, with a lot of MongoDb-style data/schemas, anti-patterns everywhere and close-to-zero understanding of (advanced)sql like usage of views, indexes, triggers, TEMP tables (note this!), administrative commands, write your own functions, know what tools the db gives for free, etc.

It sound like a lot, but even a "complex" RDBMs like postgres requiere far less study than learn JS. Is just a matter of understand that RDBMs are not "dumb" datastores.

---

I forgot to make this actionable. Most RDBMs have a comparable way to do this:

    CREATE TEMP TABLE _address ON COMMIT DROP AS
     SELECT customer_code, address_code, is_default FROM address;
Do the changes there. Your temp tables are "FREE" to get ruined.

Then apply back with a update/insert from tables:

    UPDATE address
    SET ...
    FROM _address
    WHERE address.address_code = address.address_code;
You can ALSO copy to another schema (PG) or db (sqlite) and do the mess there:

    CREATE SCHEMA fix;

    CREATE TEMP TABLE fix.name AS
     SELECT ;

    --or
    CREATE DATABASE fix WITH TEMPLATE old;
And this mean you can also "rename" the old db/schema, do a copy of it, then see if the copy was ok, if not, drop the schema and rename again.

This is how I done some of the most nerve-cracking fixes!




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: