Ask HN: Patterns for deploying webapp updates with no downtime

jeremyw · on April 14, 2010

It sounds like you're making rolling updates across your app server cluster, version n -> n + 1. You have to separate database updates into innocuous and harmful, and your developers have to signal that state for deploy.

Changes:

a) Schema changes and row updates that are compatible with 'n'. No downtime, no worries.

b) Schema changes and row updates that are _in_compatible with 'n'. Ideally this requires downtime, but I've seen architectures that get away with live rolls by grouping app server updates by shard.

c) Database changes that will have a severe performance impact, e.g. index/update a massive table or hit some other perf corner of your db. Downtime or you invest in a key-value or FriendFeed-style architecture.

Most agile updates tend to be (a), luckily.

wizard_2 · on April 14, 2010

After reading http://highscalability.com/ for a while I've happily found myself using some tips from there from time to time. I've only used these methods a few times, I usually just push large database changing changes at night and try not to do anything that takes longer then 20 minutes.

One way is to use two tables and have the application logic read from both and write to the new one. This can't always work easily but for tables that don't get a lot of joins its not that hard. Deploy the code, migrate the data into the new table and then drop the old table. I've used this once to update a user_profile table for a busy forum.

Another way to mitigate downtime on table changes is to have lots of tables. I believe one of the larger Chinese social networks was reviewed on the HS blog and boasted that they found it easier not to have more then two columns on a table (pk and value). That's a little crazy imho, but I have see it working with smaller column groups.

You use a lot of 1 to 1 relations and each column or logical group of columns gets it's own table with a foreign key to the main object. This way you can modify a column without restricting access to most of the object at the cost of more joins. I worked on a django project where we had a users table and any user information (there was a lot) was a different table. The data models were related to the user model and handled all the lookups. (User.profile, User.contact, User.reporting_prefrences, User.support_requests, etc.)

I've never used mongodb or couch, but with a nosql you can just have the app logic take care of upgrading records on read. Run a script to upgrade everything. Drop the app logic.

stingraycharles · on April 14, 2010

What we're doing is this: when doing upgrades that actually changes the data model, they go in two phases:

* First, an upgrade that understands the old model and the new model, internally uses the new model, and writes in the old model. This means that this new version is 100% compatible with the old version. We launch new services, test them, add them to the load balancer, and remove the old services from the loadbalancer.

* Secondly, a new update is launched: this one is almost the same as the previous version, except that it writes its data in the new model too. The same process with launching new services and adding to the load balancer is repeated.

Using this two-phase upgrade has the major advantage that you're always running the new services next to an old version that is completely compatible, data-model-wise, and thus allows you to do an emergency rollback to a previous version if required. The trick with adding to the load balancer also ensures that no downtime is experienced for the clients.

All this requires quite a bit of work (especially since you need to deploy multiple releases), so it depends on how much zero-downtime upgrades are worth to you.

lamby · on April 14, 2010

Are you just trying to avoid it looking "bad" for visitors, or do you actually require your site to be up that long?

If the former, one hack is just make the downtime for users more fun - I added a chat interface so that anyone waiting doesn't get too bored and can interact with other members.

Screenshot: http://lamby.uwcs.co.uk/b/playfire_maintenance.png

Architecturally, it doesn't touch our database or "main" site at all so we are free to break everything during an upgrade.

simonw · on April 14, 2010

That's really cool. We're aiming for almost full functionality in this case, but I can see how that would work great for some projects.

kylecordes · on April 14, 2010

I wrote up how we attack this problem a couple of years ago:

http://kylecordes.com/2007/01/20/web-app-swap/

including how we handle schema changes.

huherto · on April 14, 2010

Great post. The idea is very clearly explained in a few paragraphs; It is very complete since several variations are considered such as clustering, bookmarks, schema changes. Everything just makes a lot of sense.

thegoleffect · on April 14, 2010

Depends on what scale you're dealing with. If you have a high traffic site, the db should be sharded so if you do a manual switch master-slave, only a small piece would be affected at a given time.

But I'm guessing you're dealing with a single M-S setup. I've asked around and it seems the standard practice for that type of a setup is to create a second table for each one you are attempting to modify, 'insert into table2 select * from table 1;', modify table2, rename table 1, rename table 2 to table 1. Then, script or manually cope with any 'leftovers' in table 1 that didn't get ported to table 2.

Would be interesting to have an in-memory (but written to disk) NoSQL layer sandwiched between MySQL and the user. Then, you can change schema all you want or switch in/out DB servers without any visible impact. Might be a leaky abstraction though. Not like I tried that out.

simonw · on April 14, 2010

You know what, I never actually thought about doing a migration by having a duplicate of the tables running in the same database. That sounds like it could work really well - thanks for the tip.

SlyShy · on April 14, 2010

There are also options like Erlang and Node.js where hot code-swapping is possible. Although having a second database is useful as a slave, of course, I don't think it is necessary to run two copies of the database just to redeploy.

Github just redeploys by killing and restarting Unicorn workers gradually. It's graceful, because any worker that is handling a connection won't be killed, so you won't get any dropped connections. http://github.com/blog/517-unicorn

simonw · on April 14, 2010

I'm not hugely concerned about swapping out application logic, since with a bunch of application servers it's possible to pull some out of the pool, upgrade them, then use the HTTP load balancer to redirect all traffic to the servers running the updated code.

The big challenge is making changes to the database schema and co-ordinating the deployment of those changes with the switch over of traffic to the new application logic.

mpfefferle · on April 14, 2010

This may not be a perfectly generalized solution, but is it possible to structure your system such that when upgrading from version i to version i+1, version i of your app is compatible with both versions i and i+1 of your database and vice versa?

Say, for example, your factoring a column into it's own table. Don't just drop the column, set up a trigger to synchronize the original column value with a value from one of the rows in your new table.

You could then finally drop the column in version i+2.

I seem to remember finding a book on database refactorings that covered this technique in more detail. It was online so you could try Googling for it.

wvenable · on April 14, 2010

I've got a pretty good setup going, most of the changes do not require any downtime at all. Adding a table or column rarely requires any downtime (the existing code knows nothing about the table/column and continues on it's way) -- push the DB change first then the code. Removing a table or column can work as well, push the code change first and then remove them.

For more grueling changes (those that require data conversion), I still take down the site. I script the changes and then take the site down, convert, deploy, bring the site back up. Smaller changes take only a few minutes, longer changes can take hours. However, the length of the downtime is inversely proportional to how frequently you need to do it.

Sometimes taking the site down is appropriate. For really big changes, users simply cannot continue to use the site and be unaffected.

jhancock · on April 14, 2010

Its a pretty special app that can't handle a few seconds of downtime. The first thing I would do is be very certain this is a requirement.

I thought it was a requirement for a couple of webapps I manage and I now think otherwise. I have scripts for starting and stopping various server processes and have other scripts that pull them together to do full deploy like what you are talking about. I could optimize it, but after doing it this way for a bit I realized I feel safer by keeping it simple. I'm fairly certain I've never had a quality of service problem with my users.

thinkbohemian · on April 14, 2010

Not sure if this is exactly what you're getting at, I use capistrano, it was built for rails deployments and does require some scripting/setup but once you've got that down, i can push changes to any of my sites all day long. I have a few wordpress installs that i deploy with capistrano as well. Once you've got everything setup, there is no noticeable downtime.

http://www.capify.org ... to modify the database schema you can put migration logic in your scripts.

emmett · on April 14, 2010

For simple things like adding tables or adding columms, just do it. Add the column/table, then release new app code relying on it.

For more complex things (changing the name of an existing column, or breaking a table into two parts, etc.) you need to write a compatibility mode into the application code. New writes go to the new column/table name, reads go to both places. Once that's released, migrate all the data as slowly as you like behind the scenes. When you're done, you can drop the old column or table.

kfool · on May 26, 2010

This is not sane practise. There's a technology trying to address this without you having to modify your web application at all:

http://chronicdb.com

The idea is to allow both the old version of the web application and the new version to work concurrently, with no errors.

lamby · on April 14, 2010

Can you change your database paradigm? ¬_¬ A document-oriented database like CouchDB would "just work" in the most common database schema changes. Or perhaps you could throw upgrade-friendly data in a KV store encoded with Google Protocol Buffers.

lol_Sprint · on April 15, 2010

Are you working for vendor X on the Sprint.com upgrade? 'cuz they seem to be having this precise problem lately. Down since 2300 on Saturday with no end in site.