They generally send you an advance email. I just had to migrate our Jenkins server a week or two ago because of this. I received something like 15 days notice on that one.
But obviously if there's a hard failure, they aren't always going to be able to give you the amount of time you'd want. Generally speaking, you should have accounted for this situation ahead of time in your engineering plans. Amazon EC2 doesn't have anything like vmotion, it's just a bunch of KVM virts.
If you're using the GUI, the first time you try a shutdown, it will do a normal request, but then if you go back and try it again while the first request is still pending, you should see the option for doing a hard restart. Try that and give it some time. Sometimes it takes an hour or two to get through. Otherwise, Amazon's tech support can help you.
> Generally speaking, you should have accounted for this situation ahead of time in your engineering plans.
I believe this comes as a shock to most people the first time they receive this email, it was to us at least. When we signed up with amazon there was no guideline or advice saying "hey in a year or 2 your hardware might fail or need replaced, have a migration plan ready"
Perhaps it was our naivety, but we just thought hey, its the cloud, what could go wrong?! Now of course are are battle hardened
I had an instance fail (unresponsive then with that same stop/start delay) on the same day in two consecutive years. I remember because it happened to be Valentine's Day, and I had to break out the laptop for while to check on things. I always wondered if that was a completely random occurrence, or part of some maintenance schedule that randomly affected me twice.
The last place I worked has a policy of making zero changes in production Friday-Sunday or around holidays. It was one of their better practices.
I honestly thought most people knew that about EC2 as one of the core "trade offs" or engineering decision that allows the platform to be what it is, compared to a more traditional VPS provider.
That isn't a tradeoff of EC2: hardware fails... I've had servers from actual unmanaged server providers have random failures and need maintenance or need to be replaced as well. Hell: I've had my hosting provider tell me they are moving data centers before ;P. The real problem here is just that Amazon has made the idea of putting your servers in the cloud so easy that people who don't understand that the servers don't run on evil magic are able to use them.
EC2 used to terminate instances with no warning in many situations when it launched. It seems they've concluded most people didn't understand that, and avoid that whenever possible now.
But "cloud" compute services should in general be treated as less reliable per individual unit unless your provider explicitly explain to you why not (such as guaranteeing to use a high-availability distributed filesystem), as you no direct way of ascertaining status of the underlying hardware. You need to plan for failure regardless.
One of the reasons running jenkins on EC2 sucks for developers :( The data is stored on the machine, and there's a big risk of losing all your CI/CD infrastructure.
Have you given any thought to moving to something like https://circleci.com? [disclosure: I work there]
Even if it were a server, you'd have to protect yourself against the exact same risks: hardware may fail, the datacenter may burn, your data may be destroyed by cosmic rays.
Cloud platforms let you avoid physically dealing with the hardware, and conveniently using ec2-create-snapshot instead of tapes back-and-forth, but the paradigm is exact the same.
If you care about your data and your servers, you have to plan for failure. Cloud or not.
Under Xen, your instance is not quite a process in the conventional sense. The Xen hypervisor lives underneath all the OSes on the host, both those allocated to customers (domUs in Xen-speak) and the one allocated to manage the customers (dom0). Xen starts up before the dom0 kernel, and then loads the dom0 OS as a privileged instance.
This is quite different from KVM, where the hypervisor is built as kernel modules in the linux kernel, and the host linux OS acts as the management instance.
Functionally, however, the results of this setup are similar to being a process: the hypervisor may schedule your instance's CPU time, and your kernel is specialized so that when it needs memory, it calls into Xen for the appropriate mapping (similar to VM in kernel).
It's also important to note that Xen existed before and works without hardware virtualization support. That is one of the main reasons for it's approach, there was no commodity support for virtualization within the CPU, so the only safe place to handle it was very low level within the kernel.
Given the relatively low cost of excess PC hardware these days it is extremely helpful to install one with a Xen, HyperV, or some other hypervisor type system and run multiple instances on it. By doing so you will get a much better feel for what is going on when you "start", "stop", "buy" etc an EC2 instance or 'Droplet' or VPS etc.
Ok, the key to working with AWS EC2 instances is to remember that they are ephemeral and can disappear at any point in time. If your treating it like a traditional server that you have in a rack you're doing it wrong. Just turn it off and start a new one. You are using a configuration manager (puppet, chef, etc) aren't you?
I've learned a long time ago to treat traditional servers in a rack like they can disappear (or get compromised) at any time for a huge range of reasons. You can never be too paranoid.
Well, sort of. As long as you're storing all your data on EBS volumes, you can treat EC2 instances as machines in a rack. Problem with your instance? Reboot, and you'll be good as new.
Now, if you lose an EBS volume, that's totally different. You are snapshotting your EBS volumes, correct?
I think I'm missing something. Why isn't Amazon sorting this out behind the scenes so that any failing hardware is seamlessly replaced and the user is none the wiser? Am I expecting too much?
EC2 instances don't come with vmotion. It's up to the customer to detect a failed/retired node and restart on another EC2 instance.
The first thing you discover when reading through the various options is that you need to treat ALL local storage like /tmp, subject to deletion at will. Keep your persistent storage on EBS/S3.
And even if you do keep your important stuff on EBS, make sure you take snapshots on a frequent basis. We have received this email a couple of times:
Your volume experienced a failure due to multiple failures of the
underlying hardware components and we were unable to recover it.
Although EBS volumes are designed for reliability, backed by multiple
physical drives, we are still exposed to durability risks caused by
concurrent hardware failures of multiple components, before our systems
are able to restore the redundancy. We publish our durability expectations
on the EBS detail page here (http://aws.amazon.com/ebs).
Sincerely,
EBS Support
Fortunately, we had recent snapshots and it was a matter of (manually) spinning up a new instance from those.
Windows Azure actually does this. If the host your virtual machine is on for some reason fails or needs to be replaced your entire VM is migrated to another host. The migration process can take a few minutes but all your data is safe.
My point being. On this topic AWS could learn from Microsoft on how to do cloud.
The one problem they have is that the majority of their instances include local storage, which would make migration impossible. So the best they can offer is a reboot so the server ends up on another host.
They could potentially do this on their second generation (M3) instances, as well as micro instances if they wanted to. However I'd guess that these instances are just a small percentage of the overall servers used.
> The one problem they have is that the majority of their instances include local storage, which would make migration impossible
True for AWS. Using VMware vSphere then this could be done with a shared nothing migration which moves compute and storage (vMotion + storage vMotion combined).
While Xen should make live migrations technically possible, it would probably reduce EC2's provisioning flexibility and introduce undesirable complexity.
Migrations would be restricted to hosts running specific releases of the hypervisor [1], and AWS's SDN systems would need to handle these changes in very-near-realtime.
I'm working with another team of people who haven't yet tried working with cloud servers, and one of the things they're struggling with the most is that cloud servers need to be thought of as disposable. They can't easily digest the idea that servers can and will go down randomly for no known reason.
I think Amazon needs to put a lot more effort into educating people about the best practices involved here - creating immutable and disposable servers, make it easier (console access) to create availability groups, etc.
> They can't easily digest the idea that servers can and will go down randomly for no known reason.
Then you should educate them. This isn't something unique to the cloud, physical servers absolutely can do this too. I work with thousands of (physical) servers in the day job, we have all kinds of failures that take out individual hosts on a regular basis.
The problem for a lot of people is that on a small scale physical hosts can appear to be extremely stable.
With a few dozen servers total, I have servers at work that have not had a failure in 8+ years, and we have some hardware that is 12+ years, and until office and data centre moves recently we had hardware that had not been rebooted for 5 years.
We have moved everything to VMs that we take hourly copies of, and can redeploy most of our VMs in minutes because we do know we need to be prepared for hardware failures, and occasionally face them, but they are rare events at our scale.
For people with even smaller setups, with only a handful of servers, they cane easily have periods of years without any failures. Then it's easy for people to get complacent.
To work in AWS's system you must have redundant nodes -- such that any single node can be rebooted without affecting the system as a whole.
Notification that your system is on old hardware that has been deprecated is part of the price of doing business in this cloud system.
As others have noted: yes, it is a little tense (is this my production database or my Continuous Integrations machine) -- The email you get just gives you an aws-id token, so you must look it up.
but, AWS has enough components that help you build resilient systems that, if you've done you job correctly, you shouldn't care about these messages other than the labor of spinning up a replacement.
I've gotten emails a week in advance and again a day in advance when an instance needed maintenance that would result in a 10 second network reset, so it'd really surprise me if Amazon completely retired an instance with no notification. This person must have missed the email or it got spammed.
One or more of your Amazon EC2 instances have been scheduled for
maintenance. The maintenance will result in a reset of the network
connection for your instance(s). The network reset will cause all
current connections to your instance(s) to be dropped. The network
reset will take less than 1 second to complete. Once complete your
instance(s) network connectivity will be restored. The instance(s)
will have their network connections reset during the time window
listed below.
You can avoid having your network connection reset at the specified
time by rebooting your instance(s) prior to the maintenance window. To
manage an instance reboot yourself you can issue an EC2 instance
reboot command. This can be done easily from the AWS Management
Console at http://console.aws.amazon.com by first selecting your
instance and then choosing ‘Reboot’ from the ‘Instance Actions’ drop
down box.
It looks like the forum thread has been updated with the following comment from Luke@AWS
> I just looked into your instance a bit further, it does appear this was due to an issue with the underlying host on which your instance resided upon when you encountered the issues today.
> I do want to clarify here as our original reply mentioned a scheduled retirement, this was not the case and no notice was sent out because of this.
It looks like the original email was incorrect.
To add to this though if people rely on advance email notifications with AWS then they are putting at risk their availability. Just because it is on the cloud doesn't insulate you from hardware issues, these need to be planned for. AWS does provide some building blocks to address this (auto-scalaing and load balancers come to mind).
It was EBS backed instance, since I had snapshot, it does not took much time to recreate a new instance. Beside that I had another instance behind ELB to avoid downtime.
If is fair to expect some notification for scheduled retirement.
This case was about an EBS-backed instance, which is shared storage. Either way, you could avoid centrally shared storage by migrating the backing store first, then the running instance.
In any case, I'm not really trying to criticise here. Just pointing out an engineering trade-off.
Live migration doesn't have to require this. VMware vSphere for example includes storage vMotion capabilities which remove the need for shared storage.
How does vMotion pull this off? When I hear the phrase "live migration", my assumption is that the instance is serving traffic during the migration. If the instance is using local disk, then I would expect that there must be some shared state in the system, or alternately a brief outage. The latter would not be a truly live migration IMO.
Very few things would qualify as a "truly live migration" under those criteria. The only systems I can think of which would count are those which sync cpu operations across different hosts.
I don't know precisely how vmotion does it, but doing a live disc migration is basically:
- copy a snapshot of the disc image across
- pause IO in the vm
- sync any writes that have happened since taking the snapshot
- reconnect IO to the new remote
- unpause IO in the vm
Obviously you want the delay between the pause and unpause to be as short as possible, and there are many tricks to achieving that, but this hits all the fundamentals.
Agreed re: your steps. My point is just that this doesn't sound "live" to me, for non-marketing definitions of the word "live".
Looking at VMware's marketing literature [1], they claim "less than two seconds on a gigabit Ethernet network." But it sounds like that's just for the memory / cpu migration. The disk migration section of their literature doesn't have any readily-visible timing claims.
My experience with zero-downtime upgrades has always involved either bringing new stateless servers online that talk to shared storage, or adding storage nodes to an existing cluster. In both cases, this involves multiple VMs and shared state.
What does the downtime typically look like for vMotion storage migration? Do they do anything intelligent to allow checkpointing and then fast replay of just the deltas during the outage, or does "migration" really just mean "copy"? And if the former, do they impose any filesystem requirements?
> Agreed re: your steps. My point is just that this doesn't sound "live" to me, for non-marketing definitions of the word "live".
It's "live" in the sense that the guest doesn't see IO failure, or need to reboot. It might well see a pause. The more writes you're doing, the more sync you'll need to do. You might also be able to cheat a little here by pausing the guest as well, so it doesn't see any IO interruption at all, but that might not be acceptable from outside the guest. YMMV.
There are fundamental bandwidth limits at play here, so any solution to this problem is, to a certain extent, shuffling deckchairs.
> Looking at VMware's marketing literature [1], they claim "less than two seconds on a gigabit Ethernet network." But it sounds like that's just for the memory / cpu migration. The disk migration section of their literature doesn't have any readily-visible timing claims.
> My experience with zero-downtime upgrades has always involved either bringing new stateless servers online that talk to shared storage, or adding storage nodes to an existing cluster. In both cases, this involves multiple VMs and shared state.
You can balloon memory, hot-add CPUs, or resize discs upwards, with a single VM. Of course there are limits to this. If you're solving an Amazon-sized problem, they might well be important. If you can't (or don't want to) rebuild your app to fit into an Amazon-shaped hole, inflating a single machine might well be enough.
> What does the downtime typically look like for vMotion storage migration?
I don't have first-hand knowledge of storage vmotion, so I'm going off the same marketing materials you are. I have worked on a different storage migration system, though, so I am kinda familiar with the problems involved.
> Do they do anything intelligent to allow checkpointing and then fast replay of just the deltas during the outage, or does "migration" really just mean "copy"? And if the former, do they impose any filesystem requirements?
It's basically the former, although you don't actually need to checkpoint. From the Storage vMotion page:
Prior to vSphere 5.0, Storage vMotion used a mechanism called Change
Block Tracking (CBT). This method used iterative copy passes to first
copy all blocks in a VMDK to the destination datastore, then used the
changed block tracking map to copy blocks that were modified on the
source during the previous copy pass to the destination.
It sounds like 5.0-onwards is a slight simplification of this (single-pass, and presumably a live dirty-block queue), but it's not clear from either description how they stop the VM from writing faster than the migration can sync. If you're doing a multi-pass sync, you can block all IO on the final pass. That's kinda drastic, so you'd want that to be as short as possible - and again, pausing the guest so it literally can't see the IO pause might be acceptable here. Alternatively you can increase the block device's latency as the dirty block count increases, to give the storage layer a chance to catch up. Guests slow down the same amount on average, but see a more gradual IO degradation rather than dropping off a cliff.
I can't imagine they'd want to impose filesystem requirements - it's much simpler if you just assume you're just looking at a uniform array of blocks than if you have to care about structure.
Live migration requires the VM to be alive. It doesn't help when a hardware failure takes out the whole machine, so people would still need to plan for that.
Yes, and that's the direction Amazon want you to be thinking in. They could throw engineering effort at reducing the likelihood of instances going away in this sort of scenario, but they've chosen not to.
Live migration is useful, but in my opinion it's kind of a band-aid solution for the problem of availability. Live migration is not necessary if you design your application as a distributed system.
This is somewhat unrelated, but what's the general consensus on the security of EC2 for very sensitive computation?
For example, I have a client who has some algorithms and data that are potentially quite valuable. EC2 and other AWS services would be a huge help with their project, but is there any way measures could be taken to ensure that no one - even Amazon employees - can get to their code and data?
Edit: devicenull makes some good points - I guess I had the CIA's $600 million AWS contract in my head when asking my question.
There's no need to wonder about these things. Check out the AWS Security Center at http://aws.amazon.com/security/ to get the facts. At that address you will find a very detailed (39 page) Security White Paper.
No. You don't control the execution environment, so if it's really that valuable it can't be trusted.
After all, you cannot stop someone from taking a full snapshot of the VM and grabbing all the information. Encryption is no help here, as the VM ultimately needs to store the key in memory.
If it's really that valuable (lot's of companies seem to overestimate how much people would want to steal their data), then it really should never leave hardware under their control.
I have never heard anyone complain about a company taking infosec too seriously, let alone lots of companies.
Dude, my bank/email-host/health-insurer is teh suk. They overestimate the value of data confidentiality. I hope this does not become a new trend. I expect the companies that I deal with to play fast and loose with the data they control. Encrypting Data at rest? C'mon bro, if the data is so important why is it just sitting there with nobody using it.
Users complain all the time about being required to change their password every week to something unmemorable because of crazy complexity requirements.
That's all about regulatory risk, SOX, HIPAA, GLBA, etc. Let's be honest it is a "complaint" about a password policy, at best a means to an end. Unless you read that as a complaint about the motivation, because I did not.
I can't stand this the "Security is a tradeoff with usability" line. It is not. When you lock the airplane lavatory door and the light turns on what is the tradeoff? As far as I am concerned Acme Bank's website is unusable if anyone can login as me. How usable are your funds if anyone can transfer them out of your control?
War story: I was once called in to scale an application that had been running on AWS for 6 or 7 months and was failing due to excessive traffic. Normally a good problem to have, but this turned into a difficult problem because the application stored critical data on an EBS and those are, of course, not sharable. The only solution was to move to increasingly larger instances until the application could be rewritten.
Moral: If you are on the "cloud", make sure your application design fits your infrastructure.
Once upon a time there was EC2, without EBS. It was actually a pretty good place to be. There was no ambiguity because everyone who used EC2 was given a lot of warnings about how they'd have to architect their systems to avoid critical failure. I wonder if the introduction of EBS has actually increased data loss because people aren't as paranoid about it.
Whats the point of this entry ?
Are we surprised that hardware fails ?
I am the complete opposite of an EC2 fanboy but every time they decided to shut down a machine they had the good taste of sending an email to us.
But obviously if there's a hard failure, they aren't always going to be able to give you the amount of time you'd want. Generally speaking, you should have accounted for this situation ahead of time in your engineering plans. Amazon EC2 doesn't have anything like vmotion, it's just a bunch of KVM virts.
If you're using the GUI, the first time you try a shutdown, it will do a normal request, but then if you go back and try it again while the first request is still pending, you should see the option for doing a hard restart. Try that and give it some time. Sometimes it takes an hour or two to get through. Otherwise, Amazon's tech support can help you.