[Disclaimer: I work at prgmr.com] You can install a custom OS. But it can be dif...

[Disclaimer: I work at prgmr.com]

You can install a custom OS. But it can be difficult to use an installer we don't provide right now because we only allow serial console access, not VNC. This means most installers won't work out of the box. Worst case you can dd an image to the disk using ssh from the rescue image.

FYI we don't do overage charges right now. For network, if we can't throttle your traffic down then we will shut your service off.

Our blog is a little misleading these days in that for downtime for individual servers, we started emailing customers directly rather than posting to the blog. This is because we want to make sure customers see the downtime notice. We also got confused responses sometimes to the blog wondering whether a given service was affected or not and if we email directly there is no such confusion.

I think our worst case downtime barring about 5 services this year has been the following:

* 0.75 hour network outage, unplanned - 2016-03-16 (gave proportional credit)

* ~2.5 unplanned downtime due to hardware failure requiring new components - 2016-04-03 (gave 15% month credit)

* 2.6 hours downtime from start of maintenance window, planned due to security upgrade - 2016-07-23 (gave proportional credit)

* 2 hours or less downtime, planned due to security upgrade - sometime around 2016-09-01 (gave proportional credit)

* 1.5 hour network outage, unplanned - 2016-09-09 (gave proportional credit)

* 1.3 hour network outage, unplanned - 2016-11-06 (gave proportional credit)

* 2.04 hours downtime from start of maintenance window, planned due to security upgrade - 2016-11-18 (gave proportional credit)

This is a total of up to 12.69 hours downtime over the year so far, assuming downtime started at the beginning of maintenance windows (it usually started after.) Of that 6.05 hours, or less than half, was unplanned.

So far this year there's been about 336 days or 8064 hours. 12.69/8064 is 99.84% uptime overall, which is significantly lower than we would like. For some servers the uptime has so far been significantly better in that there were no hardware failures, one of the security upgrades was unnecessary, and the turnaround time for the remainder of the security upgrades was much faster than for this particular server.

For this particular server, the largest downtime contributors were security upgrades and network outages in that order. For network downtime, we got around to setting up our second upstream but there's a number of single points of failure we should take care of in 2017. There is also some additional scripting we should probably do that would cut down on the network downtime a lot, such as automatically taking down BGP if connectivity beyond the first hop is lost.

For the security update downtime, I think our most realistic bet right now is to get ourselves on the latest version of Xen once it comes out. That will hopefully have a stable implementation (not a technology preview) for live patching.