The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines
We have implemented a commercial enterprise-grade system for providing fault-tolerant virtual machines, based on the approach of replicating the execution of a primary virtual machine (VM) via a backup virtual machine on another server. We have designed a complete system in VMware vSphere 4.0 that is easy to use, runs on commodity servers, and typically reduces performance of real applications by less than 10%. Our method for replicating VM execution is similar to that described in a paper by Bressoud et al., but we have made a number of significant design changes that greatly improve performance. In addition, an easy-to-use, commercial system that automatically restores redundancy after failure requires many additional components beyond replicated VM execution. We have designed and implemented these extra components and addressed many practical issues encountered in supporting VMs running enterprise applications. In this paper, we describe our basic design, discuss alternate design choices and a number of the implementation details, and provide an evaluation of our performance for both micro-benchmarks and real applications.