How does vMotion pull this off? When I hear the phrase "live migration", my assu...

regularfry · on Oct 27, 2013

Very few things would qualify as a "truly live migration" under those criteria. The only systems I can think of which would count are those which sync cpu operations across different hosts.

I don't know precisely how vmotion does it, but doing a live disc migration is basically:

    - copy a snapshot of the disc image across
    - pause IO in the vm
    - sync any writes that have happened since taking the snapshot
    - reconnect IO to the new remote
    - unpause IO in the vm

Obviously you want the delay between the pause and unpause to be as short as possible, and there are many tricks to achieving that, but this hits all the fundamentals.

pcl · on Oct 27, 2013

Agreed re: your steps. My point is just that this doesn't sound "live" to me, for non-marketing definitions of the word "live".

Looking at VMware's marketing literature [1], they claim "less than two seconds on a gigabit Ethernet network." But it sounds like that's just for the memory / cpu migration. The disk migration section of their literature doesn't have any readily-visible timing claims.

My experience with zero-downtime upgrades has always involved either bringing new stateless servers online that talk to shared storage, or adding storage nodes to an existing cluster. In both cases, this involves multiple VMs and shared state.

What does the downtime typically look like for vMotion storage migration? Do they do anything intelligent to allow checkpointing and then fast replay of just the deltas during the outage, or does "migration" really just mean "copy"? And if the former, do they impose any filesystem requirements?

[1] http://www.vmware.com/products/vsphere/features-vmotion

regularfry · on Oct 27, 2013

> Agreed re: your steps. My point is just that this doesn't sound "live" to me, for non-marketing definitions of the word "live".

It's "live" in the sense that the guest doesn't see IO failure, or need to reboot. It might well see a pause. The more writes you're doing, the more sync you'll need to do. You might also be able to cheat a little here by pausing the guest as well, so it doesn't see any IO interruption at all, but that might not be acceptable from outside the guest. YMMV.

There are fundamental bandwidth limits at play here, so any solution to this problem is, to a certain extent, shuffling deckchairs.

> Looking at VMware's marketing literature [1], they claim "less than two seconds on a gigabit Ethernet network." But it sounds like that's just for the memory / cpu migration. The disk migration section of their literature doesn't have any readily-visible timing claims.

http://www.vmware.com/products/vsphere/features-storage-vmot... claims "zero-downtime". I guess it depends how you define "downtime", really.

> My experience with zero-downtime upgrades has always involved either bringing new stateless servers online that talk to shared storage, or adding storage nodes to an existing cluster. In both cases, this involves multiple VMs and shared state.

You can balloon memory, hot-add CPUs, or resize discs upwards, with a single VM. Of course there are limits to this. If you're solving an Amazon-sized problem, they might well be important. If you can't (or don't want to) rebuild your app to fit into an Amazon-shaped hole, inflating a single machine might well be enough.

> What does the downtime typically look like for vMotion storage migration?

I don't have first-hand knowledge of storage vmotion, so I'm going off the same marketing materials you are. I have worked on a different storage migration system, though, so I am kinda familiar with the problems involved.

> Do they do anything intelligent to allow checkpointing and then fast replay of just the deltas during the outage, or does "migration" really just mean "copy"? And if the former, do they impose any filesystem requirements?

It's basically the former, although you don't actually need to checkpoint. From the Storage vMotion page:

    Prior to vSphere 5.0, Storage vMotion used a mechanism called Change
    Block Tracking (CBT). This method used iterative copy passes to first
    copy all blocks in a VMDK to the destination datastore, then used the
    changed block tracking map to copy blocks that were modified on the
    source during the previous copy pass to the destination.

It sounds like 5.0-onwards is a slight simplification of this (single-pass, and presumably a live dirty-block queue), but it's not clear from either description how they stop the VM from writing faster than the migration can sync. If you're doing a multi-pass sync, you can block all IO on the final pass. That's kinda drastic, so you'd want that to be as short as possible - and again, pausing the guest so it literally can't see the IO pause might be acceptable here. Alternatively you can increase the block device's latency as the dirty block count increases, to give the storage layer a chance to catch up. Guests slow down the same amount on average, but see a more gradual IO degradation rather than dropping off a cliff.

I can't imagine they'd want to impose filesystem requirements - it's much simpler if you just assume you're just looking at a uniform array of blocks than if you have to care about structure.