Your service should be able to cope with a failure. Just ensure your updates don...

yjftsjthsd-h · on Dec 31, 2020

Okay, I have a server running the database (postgres) that backs my user-facing apps. How exactly shall I run kernel/glibc/systemd/postgres updates without customer-facing outages?

EDIT: To be clear, I'm mostly arguing that you're wrong, but if you have a solution then that'd be great, because I dislike the difficulty in patching these things.

midasuni · on Dec 31, 2020

Been a while since I’ve needed to run a critical database, but the options were generally a master-slave with replication, or an active-active cluster

yjftsjthsd-h · on Dec 31, 2020

Yes, we're currently doing primary-replica but that still requires downtime while we do a failover.

midasuni · on Jan 1, 2021

I remember building a system about 15 years ago which automatically failed over in under a second - the application connected to the most recent one and set up replication to the other one. If the master went down then the slave was promoted to master and when the original master came back it was configured as slave.

Looks like https://www.citusdata.com/blog/2019/05/30/introducing-pg-aut... will sort out a Postgres master master for you.

I certainly don’t have auto updates on all my servers (wouldn’t be good if a tv program going out to millions of people suddenly vanished — there’s always a glitch when it fails over), I do have it on public facing servers though, so in your case I’d have it on the web servers but not on the database.

tutfbhuf · on Dec 31, 2020

Some people only have one server for a particular use case.

And servers might not always break in predictable/alertable ways.

midasuni · on Dec 31, 2020

If your service is only provided by one machine then you have to accept it will occasionally be out of use.

If the only monitoring of the service you have are users telling you, then that’s not a problem you can avoid by not updating.

tutfbhuf · on Dec 31, 2020

> If your service is only provided by one machine then you have to accept it will occasionally be out of use.

But, I can minimize possible downtimes, by learning from other downtime reports on a bug tracker.