AWS instance was scheduled for retirement

mdellabitta · on Oct 27, 2013

They generally send you an advance email. I just had to migrate our Jenkins server a week or two ago because of this. I received something like 15 days notice on that one.

But obviously if there's a hard failure, they aren't always going to be able to give you the amount of time you'd want. Generally speaking, you should have accounted for this situation ahead of time in your engineering plans. Amazon EC2 doesn't have anything like vmotion, it's just a bunch of KVM virts.

If you're using the GUI, the first time you try a shutdown, it will do a normal request, but then if you go back and try it again while the first request is still pending, you should see the option for doing a hard restart. Try that and give it some time. Sometimes it takes an hour or two to get through. Otherwise, Amazon's tech support can help you.

rschmitty · on Oct 27, 2013

> Generally speaking, you should have accounted for this situation ahead of time in your engineering plans.

I believe this comes as a shock to most people the first time they receive this email, it was to us at least. When we signed up with amazon there was no guideline or advice saying "hey in a year or 2 your hardware might fail or need replaced, have a migration plan ready"

Perhaps it was our naivety, but we just thought hey, its the cloud, what could go wrong?! Now of course are are battle hardened

m_ram · on Oct 27, 2013

FWIW, they have a brief section about hardware failure in the Getting Started Guide [1]. I don't know how long it's been there.

[1] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-... (bottom of the page)

michaelmior · on Oct 27, 2013

More accurately: "Your hardware my fail at any time."

rch · on Oct 27, 2013

I had an instance fail (unresponsive then with that same stop/start delay) on the same day in two consecutive years. I remember because it happened to be Valentine's Day, and I had to break out the laptop for while to check on things. I always wondered if that was a completely random occurrence, or part of some maintenance schedule that randomly affected me twice.

The last place I worked has a policy of making zero changes in production Friday-Sunday or around holidays. It was one of their better practices.

cmelbye · on Oct 27, 2013

I honestly thought most people knew that about EC2 as one of the core "trade offs" or engineering decision that allows the platform to be what it is, compared to a more traditional VPS provider.

saurik · on Oct 28, 2013

That isn't a tradeoff of EC2: hardware fails... I've had servers from actual unmanaged server providers have random failures and need maintenance or need to be replaced as well. Hell: I've had my hosting provider tell me they are moving data centers before ;P. The real problem here is just that Amazon has made the idea of putting your servers in the cloud so easy that people who don't understand that the servers don't run on evil magic are able to use them.

vidarh · on Oct 27, 2013

EC2 used to terminate instances with no warning in many situations when it launched. It seems they've concluded most people didn't understand that, and avoid that whenever possible now.

But "cloud" compute services should in general be treated as less reliable per individual unit unless your provider explicitly explain to you why not (such as guaranteeing to use a high-availability distributed filesystem), as you no direct way of ascertaining status of the underlying hardware. You need to plan for failure regardless.

pbiggar · on Oct 27, 2013

One of the reasons running jenkins on EC2 sucks for developers :( The data is stored on the machine, and there's a big risk of losing all your CI/CD infrastructure.

Have you given any thought to moving to something like https://circleci.com? [disclosure: I work there]

maaku · on Oct 27, 2013

That's why you use EBS and snapshots...

pbiggar · on Oct 28, 2013

EBS is really slow for CI loads, in my experience. We deliberately don't touch it.

mdellabitta · on Oct 27, 2013

It wasn't that bad. I stopped and started an instance. Done.

preetamjinka · on Oct 27, 2013

EC2 uses Xen.

noonespecial · on Oct 27, 2013

Remember kids, an EC2 is not a server. It's a process on someone else's server and all of your data is stored in /tmp. Do plan accordingly.

guiambros · on Oct 27, 2013

Even if it were a server, you'd have to protect yourself against the exact same risks: hardware may fail, the datacenter may burn, your data may be destroyed by cosmic rays.

Cloud platforms let you avoid physically dealing with the hardware, and conveniently using ec2-create-snapshot instead of tapes back-and-forth, but the paradigm is exact the same.

If you care about your data and your servers, you have to plan for failure. Cloud or not.

acmecorps · on Oct 27, 2013

It's a process on someone else's server? Do you have a link where I can know more on this?

preetamjinka · on Oct 27, 2013

I'm not sure about Xen (what AWS uses), but if you look at KVMs as an analogy, instances are literally processes on the host.

spartango · on Oct 27, 2013

Under Xen, your instance is not quite a process in the conventional sense. The Xen hypervisor lives underneath all the OSes on the host, both those allocated to customers (domUs in Xen-speak) and the one allocated to manage the customers (dom0). Xen starts up before the dom0 kernel, and then loads the dom0 OS as a privileged instance.

This is quite different from KVM, where the hypervisor is built as kernel modules in the linux kernel, and the host linux OS acts as the management instance.

Functionally, however, the results of this setup are similar to being a process: the hypervisor may schedule your instance's CPU time, and your kernel is specialized so that when it needs memory, it calls into Xen for the appropriate mapping (similar to VM in kernel).

kbenson · on Oct 28, 2013

It's also important to note that Xen existed before and works without hardware virtualization support. That is one of the main reasons for it's approach, there was no commodity support for virtualization within the CPU, so the only safe place to handle it was very low level within the kernel.

ChuckMcM · on Oct 27, 2013

Given the relatively low cost of excess PC hardware these days it is extremely helpful to install one with a Xen, HyperV, or some other hypervisor type system and run multiple instances on it. By doing so you will get a much better feel for what is going on when you "start", "stop", "buy" etc an EC2 instance or 'Droplet' or VPS etc.

Corrado · on Oct 27, 2013

Ok, the key to working with AWS EC2 instances is to remember that they are ephemeral and can disappear at any point in time. If your treating it like a traditional server that you have in a rack you're doing it wrong. Just turn it off and start a new one. You are using a configuration manager (puppet, chef, etc) aren't you?

bowlofpetunias · on Oct 27, 2013

I've learned a long time ago to treat traditional servers in a rack like they can disappear (or get compromised) at any time for a huge range of reasons. You can never be too paranoid.

manmal · on Oct 27, 2013

Yeah, I like how Netflix even goes so far as killing instances randomly: http://venturebeat.com/2013/09/10/netflix-chaos-monkeys/

jpitz · on Oct 27, 2013

This ^^^ comment cannot be upvoted enough. Please treat all servers, virtual or not, as ephemeral.

toomuchtodo · on Oct 27, 2013

Well, sort of. As long as you're storing all your data on EBS volumes, you can treat EC2 instances as machines in a rack. Problem with your instance? Reboot, and you'll be good as new.

Now, if you lose an EBS volume, that's totally different. You are snapshotting your EBS volumes, correct?

jes5199 · on Oct 27, 2013

as someone who used to be a Puppet maintainer, I say: use Chef. or Ansible.

apetresc · on Oct 27, 2013

Not only do they send you an e-mail about this, they even have an API call for it: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitorin...

Anyone who's surprised that this happens has not used EC2 very much. It is this way by design.

kartikkumar · on Oct 27, 2013

I think I'm missing something. Why isn't Amazon sorting this out behind the scenes so that any failing hardware is seamlessly replaced and the user is none the wiser? Am I expecting too much?

ghshephard · on Oct 27, 2013

EC2 instances don't come with vmotion. It's up to the customer to detect a failed/retired node and restart on another EC2 instance.

The first thing you discover when reading through the various options is that you need to treat ALL local storage like /tmp, subject to deletion at will. Keep your persistent storage on EBS/S3.

arturhoo · on Oct 27, 2013

And even if you do keep your important stuff on EBS, make sure you take snapshots on a frequent basis. We have received this email a couple of times:

    Your volume experienced a failure due to multiple failures of the 
    underlying hardware components and we were unable to recover it.

    Although EBS volumes are designed for reliability, backed by multiple 
    physical drives, we are still exposed to durability risks caused by 
    concurrent hardware failures of multiple components, before our systems 
    are able to restore the redundancy. We publish our durability expectations 
    on the EBS detail page here (http://aws.amazon.com/ebs).


    Sincerely,
    EBS Support

Fortunately, we had recent snapshots and it was a matter of (manually) spinning up a new instance from those.

Edit: proper quotation

RSDnuyIsk9jMSWb · on Oct 27, 2013

Windows Azure actually does this. If the host your virtual machine is on for some reason fails or needs to be replaced your entire VM is migrated to another host. The migration process can take a few minutes but all your data is safe.

My point being. On this topic AWS could learn from Microsoft on how to do cloud.

SudoAlex · on Oct 27, 2013

The one problem they have is that the majority of their instances include local storage, which would make migration impossible. So the best they can offer is a reboot so the server ends up on another host.

They could potentially do this on their second generation (M3) instances, as well as micro instances if they wanted to. However I'd guess that these instances are just a small percentage of the overall servers used.

travem · on Oct 27, 2013

> The one problem they have is that the majority of their instances include local storage, which would make migration impossible

True for AWS. Using VMware vSphere then this could be done with a shared nothing migration which moves compute and storage (vMotion + storage vMotion combined).

justizin · on Oct 27, 2013

Xen also does this, but AWS may still be using a very old fork of Xen 2 or 3.

It's unfortunate that there doesn't seem to be a large-scale consumer for the XenServer platform now that Citrix open-sourced the entire thing.

cheeseprocedure · on Oct 27, 2013

While Xen should make live migrations technically possible, it would probably reduce EC2's provisioning flexibility and introduce undesirable complexity.

Migrations would be restricted to hosts running specific releases of the hypervisor [1], and AWS's SDN systems would need to handle these changes in very-near-realtime.

[1] wiki.xen.org/wiki/Xen_Version_Compatibility

tschellenbach · on Oct 27, 2013

Usually they send you a nice email about this. Then you have to lookup the instance and hope its a webworker and not our main database :)

sudhirj · on Oct 27, 2013

I'm working with another team of people who haven't yet tried working with cloud servers, and one of the things they're struggling with the most is that cloud servers need to be thought of as disposable. They can't easily digest the idea that servers can and will go down randomly for no known reason.

I think Amazon needs to put a lot more effort into educating people about the best practices involved here - creating immutable and disposable servers, make it easier (console access) to create availability groups, etc.

dlgeek · on Oct 27, 2013

> They can't easily digest the idea that servers can and will go down randomly for no known reason.

Then you should educate them. This isn't something unique to the cloud, physical servers absolutely can do this too. I work with thousands of (physical) servers in the day job, we have all kinds of failures that take out individual hosts on a regular basis.

vidarh · on Oct 27, 2013

The problem for a lot of people is that on a small scale physical hosts can appear to be extremely stable.

With a few dozen servers total, I have servers at work that have not had a failure in 8+ years, and we have some hardware that is 12+ years, and until office and data centre moves recently we had hardware that had not been rebooted for 5 years.

We have moved everything to VMs that we take hourly copies of, and can redeploy most of our VMs in minutes because we do know we need to be prepared for hardware failures, and occasionally face them, but they are rare events at our scale.

For people with even smaller setups, with only a handful of servers, they cane easily have periods of years without any failures. Then it's easy for people to get complacent.

RockyMcNuts · on Oct 27, 2013

I've gotten one of those emails and thought, OK it's gonna reboot, not a problem for that instance, has no persistent data I care about.

Then it kept running, but there was no way to reboot it from EC2 console or ssh, so that was a bit of a problem, had to get support to do it.

Moral - reboot it yourself at a convenient time.

keithgabryelski · on Oct 27, 2013

To work in AWS's system you must have redundant nodes -- such that any single node can be rebooted without affecting the system as a whole.

Notification that your system is on old hardware that has been deprecated is part of the price of doing business in this cloud system.

As others have noted: yes, it is a little tense (is this my production database or my Continuous Integrations machine) -- The email you get just gives you an aws-id token, so you must look it up.

but, AWS has enough components that help you build resilient systems that, if you've done you job correctly, you shouldn't care about these messages other than the labor of spinning up a replacement.

chris_wot · on Oct 27, 2013

What, you don't get notified?

kalleboo · on Oct 27, 2013

I've gotten emails a week in advance and again a day in advance when an instance needed maintenance that would result in a 10 second network reset, so it'd really surprise me if Amazon completely retired an instance with no notification. This person must have missed the email or it got spammed.

themonk · on Oct 27, 2013

Can you tell me subject line of email.

I searched my mails for couple of words including instance ID, result is negative. No email in spam folder in last one month.

kalleboo · on Oct 27, 2013

From: Amazon EC2 Notification <[email protected]>

Subject: Amazon EC2 Maintenance - Instance Maintenance Account: NNNNNNNNNN

Dear Amazon EC2 Customer,

One or more of your Amazon EC2 instances have been scheduled for maintenance. The maintenance will result in a reset of the network connection for your instance(s). The network reset will cause all current connections to your instance(s) to be dropped. The network reset will take less than 1 second to complete. Once complete your instance(s) network connectivity will be restored. The instance(s) will have their network connections reset during the time window listed below.

You can avoid having your network connection reset at the specified time by rebooting your instance(s) prior to the maintenance window. To manage an instance reboot yourself you can issue an EC2 instance reboot command. This can be done easily from the AWS Management Console at http://console.aws.amazon.com by first selecting your instance and then choosing ‘Reboot’ from the ‘Instance Actions’ drop down box.

themonk · on Oct 27, 2013

Thanks, got an update from AWS, it was not scheduled retirement, it was some other issue.

travem · on Oct 27, 2013

It looks like the forum thread has been updated with the following comment from Luke@AWS

> I just looked into your instance a bit further, it does appear this was due to an issue with the underlying host on which your instance resided upon when you encountered the issues today.

> I do want to clarify here as our original reply mentioned a scheduled retirement, this was not the case and no notice was sent out because of this.

It looks like the original email was incorrect.

To add to this though if people rely on advance email notifications with AWS then they are putting at risk their availability. Just because it is on the cloud doesn't insulate you from hardware issues, these need to be planned for. AWS does provide some building blocks to address this (auto-scalaing and load balancers come to mind).

sudhirj · on Oct 27, 2013

Maybe 'scheduled for retirement' is a euphemism for 'someone tripped over that rack's power cord'.

primitivesuave · on Oct 27, 2013

Made my day.

themonk · on Oct 27, 2013

No I did not got any notification.

It was EBS backed instance, since I had snapshot, it does not took much time to recreate a new instance. Beside that I had another instance behind ELB to avoid downtime.

If is fair to expect some notification for scheduled retirement.

latch · on Oct 27, 2013

They usually send an email a couple weeks before

the_mitsuhiko · on Oct 27, 2013

You do.

dabs_return · on Oct 27, 2013

Luke@AWS updated your thread. Makes a lot more sense now as a notice would only be sent if it was a scheduled eviction.

regularfry · on Oct 27, 2013

Interesting that they've gone that way rather than attempt any sort of live migration.

devicenull · on Oct 27, 2013

Not really. Live migration requires shared storage, which is yet another bottleneck, and yet another point of failure.

regularfry · on Oct 27, 2013

This case was about an EBS-backed instance, which is shared storage. Either way, you could avoid centrally shared storage by migrating the backing store first, then the running instance.

In any case, I'm not really trying to criticise here. Just pointing out an engineering trade-off.

travem · on Oct 27, 2013

Live migration doesn't have to require this. VMware vSphere for example includes storage vMotion capabilities which remove the need for shared storage.

pcl · on Oct 27, 2013

How does vMotion pull this off? When I hear the phrase "live migration", my assumption is that the instance is serving traffic during the migration. If the instance is using local disk, then I would expect that there must be some shared state in the system, or alternately a brief outage. The latter would not be a truly live migration IMO.

regularfry · on Oct 27, 2013

Very few things would qualify as a "truly live migration" under those criteria. The only systems I can think of which would count are those which sync cpu operations across different hosts.

I don't know precisely how vmotion does it, but doing a live disc migration is basically:

    - copy a snapshot of the disc image across
    - pause IO in the vm
    - sync any writes that have happened since taking the snapshot
    - reconnect IO to the new remote
    - unpause IO in the vm

Obviously you want the delay between the pause and unpause to be as short as possible, and there are many tricks to achieving that, but this hits all the fundamentals.

pcl · on Oct 27, 2013

Agreed re: your steps. My point is just that this doesn't sound "live" to me, for non-marketing definitions of the word "live".

Looking at VMware's marketing literature [1], they claim "less than two seconds on a gigabit Ethernet network." But it sounds like that's just for the memory / cpu migration. The disk migration section of their literature doesn't have any readily-visible timing claims.

My experience with zero-downtime upgrades has always involved either bringing new stateless servers online that talk to shared storage, or adding storage nodes to an existing cluster. In both cases, this involves multiple VMs and shared state.

What does the downtime typically look like for vMotion storage migration? Do they do anything intelligent to allow checkpointing and then fast replay of just the deltas during the outage, or does "migration" really just mean "copy"? And if the former, do they impose any filesystem requirements?

[1] http://www.vmware.com/products/vsphere/features-vmotion

regularfry · on Oct 27, 2013

> Agreed re: your steps. My point is just that this doesn't sound "live" to me, for non-marketing definitions of the word "live".

It's "live" in the sense that the guest doesn't see IO failure, or need to reboot. It might well see a pause. The more writes you're doing, the more sync you'll need to do. You might also be able to cheat a little here by pausing the guest as well, so it doesn't see any IO interruption at all, but that might not be acceptable from outside the guest. YMMV.

There are fundamental bandwidth limits at play here, so any solution to this problem is, to a certain extent, shuffling deckchairs.

> Looking at VMware's marketing literature [1], they claim "less than two seconds on a gigabit Ethernet network." But it sounds like that's just for the memory / cpu migration. The disk migration section of their literature doesn't have any readily-visible timing claims.

http://www.vmware.com/products/vsphere/features-storage-vmot... claims "zero-downtime". I guess it depends how you define "downtime", really.

> My experience with zero-downtime upgrades has always involved either bringing new stateless servers online that talk to shared storage, or adding storage nodes to an existing cluster. In both cases, this involves multiple VMs and shared state.

You can balloon memory, hot-add CPUs, or resize discs upwards, with a single VM. Of course there are limits to this. If you're solving an Amazon-sized problem, they might well be important. If you can't (or don't want to) rebuild your app to fit into an Amazon-shaped hole, inflating a single machine might well be enough.

> What does the downtime typically look like for vMotion storage migration?

I don't have first-hand knowledge of storage vmotion, so I'm going off the same marketing materials you are. I have worked on a different storage migration system, though, so I am kinda familiar with the problems involved.

> Do they do anything intelligent to allow checkpointing and then fast replay of just the deltas during the outage, or does "migration" really just mean "copy"? And if the former, do they impose any filesystem requirements?

It's basically the former, although you don't actually need to checkpoint. From the Storage vMotion page:

    Prior to vSphere 5.0, Storage vMotion used a mechanism called Change
    Block Tracking (CBT). This method used iterative copy passes to first
    copy all blocks in a VMDK to the destination datastore, then used the
    changed block tracking map to copy blocks that were modified on the
    source during the previous copy pass to the destination.

It sounds like 5.0-onwards is a slight simplification of this (single-pass, and presumably a live dirty-block queue), but it's not clear from either description how they stop the VM from writing faster than the migration can sync. If you're doing a multi-pass sync, you can block all IO on the final pass. That's kinda drastic, so you'd want that to be as short as possible - and again, pausing the guest so it literally can't see the IO pause might be acceptable here. Alternatively you can increase the block device's latency as the dirty block count increases, to give the storage layer a chance to catch up. Guests slow down the same amount on average, but see a more gradual IO degradation rather than dropping off a cliff.

I can't imagine they'd want to impose filesystem requirements - it's much simpler if you just assume you're just looking at a uniform array of blocks than if you have to care about structure.

wmf · on Oct 27, 2013

Live migration requires the VM to be alive. It doesn't help when a hardware failure takes out the whole machine, so people would still need to plan for that.

regularfry · on Oct 27, 2013

Yes, and that's the direction Amazon want you to be thinking in. They could throw engineering effort at reducing the likelihood of instances going away in this sort of scenario, but they've chosen not to.

oakwhiz · on Oct 28, 2013

Live migration is useful, but in my opinion it's kind of a band-aid solution for the problem of availability. Live migration is not necessary if you design your application as a distributed system.

sudhirj · on Oct 27, 2013

Reminds me of http://www.goodreads.com/quotes/379100-there-s-no-point-in-a...

rpm4321 · on Oct 27, 2013

This is somewhat unrelated, but what's the general consensus on the security of EC2 for very sensitive computation?

For example, I have a client who has some algorithms and data that are potentially quite valuable. EC2 and other AWS services would be a huge help with their project, but is there any way measures could be taken to ensure that no one - even Amazon employees - can get to their code and data?

Edit: devicenull makes some good points - I guess I had the CIA's $600 million AWS contract in my head when asking my question.

jeffbarr · on Oct 27, 2013

There's no need to wonder about these things. Check out the AWS Security Center at http://aws.amazon.com/security/ to get the facts. At that address you will find a very detailed (39 page) Security White Paper.

devicenull · on Oct 27, 2013

No. You don't control the execution environment, so if it's really that valuable it can't be trusted.

After all, you cannot stop someone from taking a full snapshot of the VM and grabbing all the information. Encryption is no help here, as the VM ultimately needs to store the key in memory.

If it's really that valuable (lot's of companies seem to overestimate how much people would want to steal their data), then it really should never leave hardware under their control.

dfc · on Oct 27, 2013

I have never heard anyone complain about a company taking infosec too seriously, let alone lots of companies.

Dude, my bank/email-host/health-insurer is teh suk. They overestimate the value of data confidentiality. I hope this does not become a new trend. I expect the companies that I deal with to play fast and loose with the data they control. Encrypting Data at rest? C'mon bro, if the data is so important why is it just sitting there with nobody using it.

daxelrod · on Oct 27, 2013

Sure you have.

https://twitter.com/AlanHungover/status/393822237926903808

Users complain all the time about being required to change their password every week to something unmemorable because of crazy complexity requirements.

Security is a tradeoff with usability.

dfc · on Oct 27, 2013

That's all about regulatory risk, SOX, HIPAA, GLBA, etc. Let's be honest it is a "complaint" about a password policy, at best a means to an end. Unless you read that as a complaint about the motivation, because I did not.

I can't stand this the "Security is a tradeoff with usability" line. It is not. When you lock the airplane lavatory door and the light turns on what is the tradeoff? As far as I am concerned Acme Bank's website is unusable if anyone can login as me. How usable are your funds if anyone can transfer them out of your control?

goblin89 · on Oct 27, 2013

I'd be OK with uploading sensitive data onto S3, as long as it's properly encrypted, but with EC2 I guess you can never tell.

gyepi · on Oct 27, 2013

War story: I was once called in to scale an application that had been running on AWS for 6 or 7 months and was failing due to excessive traffic. Normally a good problem to have, but this turned into a difficult problem because the application stored critical data on an EBS and those are, of course, not sharable. The only solution was to move to increasingly larger instances until the application could be rewritten. Moral: If you are on the "cloud", make sure your application design fits your infrastructure.

aidos · on Oct 27, 2013

Once upon a time there was EC2, without EBS. It was actually a pretty good place to be. There was no ambiguity because everyone who used EC2 was given a lot of warnings about how they'd have to architect their systems to avoid critical failure. I wonder if the introduction of EBS has actually increased data loss because people aren't as paranoid about it.

tbarbugli · on Oct 27, 2013

Whats the point of this entry ? Are we surprised that hardware fails ? I am the complete opposite of an EC2 fanboy but every time they decided to shut down a machine they had the good taste of sending an email to us.