Note: VMmark 1.x is now permanently retired. It will remain available for download for academic purposes only. The VMmark 1.x results page will remain available for reference.
Get an accurate measurement of application performance in virtualized environments with VMmark, the industry's first virtualization benchmark for x86-based computers. For academic use only.
For the latest version go to: VMmark Version 2
What is VMmark?
VMmark is the first benchmark that was designed specifically to quantify and measure the performance of virtualized environments. It features a novel tile-based scheme for measuring the scalability of consolidated workloads and provides a consistent methodology that captures both the overall scalability and individual application performance.
The VMmark benchmark is built on our expertise in virtualization performance and incorporates popular workloads from application categories most commonly represented in customer data centers.
Why is there a need for a new benchmark?
Traditional server benchmarks that exist today were developed with neither virtual machines nor server consolidation in mind and focus on a single workload per server. These benchmarks do not capture the system behavior induced by multiple virtual machines and fail to provide sufficient insight into the scalability of virtual environments supporting multiple simultaneous workloads on the same server.
Organizations implementing or evaluating virtualization platforms today also need a more realistic and specialized benchmark to help them compare performance and scalability of different virtualization platforms, make appropriate hardware choices and measure platform performance on an ongoing basis.
Clearly, a more sophisticated approach is required to quantify a virtualization environment's performance and develop meaningful and precise metrics in order to effectively compare the suitability and performance of different hardware platforms for virtual environments. Also, there is a need for a common workload and methodology for virtualized systems so that benchmark results can be compared across different virtualization platforms.
What are some specific requirements for developing such a benchmark?
Besides the need to capture key performance characteristics of virtual systems, an appropriate virtual machine benchmark must employ realistic, diverse workloads running on multiple operating systems. Further, there is a need to define a single, easy to understand metric while ensuring that the benchmark is representative of various end user environments. The benchmark specification needs to be platform neutral and should also provide a methodical way to measure scalability so that the same benchmark can be used for small servers as well as larger servers from different hardware vendors.
Why did VMware develop VMmark?
VMware realized the need for a virtualization benchmark early on with more and more customers asking for metrics to compare different hardware platforms and configurations on which to run their virtualized environments.
VMmark provides a standardized way to compare platforms that customers have come to expect from enterprise software.
Will VMmark be an industry standard? If so, what is VMware doing towards this goal?
VMware is actively working on open standards on virtualization benchmarks. In October 2006, SPEC formed a working group to develop a standard benchmark for measuring virtualization performance. This working group was formed at the request of VMware. By March 2007 the working group had agreed on the design goals and project plan, and was able to graduate to a subcommittee. Paula Smith of VMware was the chairperson of the working group and continues to chair the subcommittee. We are an active participant in the subcommittee, along with many of our major partners and a few of our competitors. Current participants include: AMD, Dell, Fujitsu Siemens, Hewlett-Packard, Intel, IBM, Microsoft, Red Hat, Sun Microsystems, SWsoft and VMware. Additional information on the subcommittee can be found at: http://www.spec.org/specvirtualization/
How was VMmark developed?
Nearly two years of engineering effort has gone into the design and implementation of the benchmark culminating in a private beta release in December, 2006 and the current public beta (launched in July 2007) as a part of VMware’s normal product release cycle.
In the effort to build a reliable and robust benchmark that truly represents customer environments, VMware has taken into account extensive survey data from its customers to understand what types of applications and configurations are typically run in virtualized environments. VMware has also worked closely with its partners to design and implement the benchmark across various software and hardware platforms. Throughout the course of the benchmark development, VMware has also evaluated numerous workloads and run hundreds of experiments to make sure that the benchmark is reliable and robust.
What is a tile?
A tile is a collection of six diverse workloads concurrently executing specific software. Running on one of two separate operating systems, each workload runs in its own virtual machine and executes applications found in all the world's datacenters. Included in a single tile are a web server, file server, mail server, database, java server, as well as an idle machine.
Each virtual machine in a tile is tuned to use only a fraction of the system's total resources. As a tile, the aggregate of all six workloads normally utilizes less than the full capacity of modern servers. Therefore, the complete saturation of a system's resources and accurate measurement of server performance with VMmark require the execution of multiple tiles simultaneously.
How does VMmark work?
VMmark is designed as a tile-based benchmark consisting of a diverse set of workloads commonly found in the datacenter, including database server, file server, web server, and Java server. The workloads comprising each tile are run simultaneously in separate virtual machines at load levels that are typical of virtualized environments. The performance of each workload is measured and then combined with the other workloads to form the score for the individual tile. Multiple tiles can be run simultaneously to increase the overall score.
This approach allows smaller increases in system performance to be reflected by increased scores in a single tile and larger gains in system capacity to be captured to adding additional tiles. (Future work will present data to demonstrate the ability of multiple tiles to measure performance of larger multiprocessor systems using a well-defined reference score).
Each workload within a VMmark tile is constrained to execute at less than full utilization of its virtual machine. However, the performance of each workload can vary to a degree with the speed and capabilities of the underlying system. For instance, disk-centric workloads might respond to the addition of a fast disk array with a more favorable score. These variations can capture system improvements that do not warrant the addition of another tile. However, the workload throttling will force the use of additional tiles for large jumps in system performance. When the number of` tiles is increased, workloads in existing tiles might measure lower performance. However, if the system has not been overcommitted, the aggregate score, including the new tile, should increase. The result is a flexible benchmark metric that provides a relative measure of the number of workloads that can be supported by a particular system as well as the overall performance level within the virtual machines.
Who will use VMmark?
VMmark was developed as a useful tool for hardware vendors, system integrators, and customers to evaluate the performance of their systems. Many customers will not run the benchmark themselves, but rather rely on published VMmark scores from their hardware vendors to make purchasing and configuration decisions for their virtualization infrastructure.
What are the use cases for VMmark?
The main use-case for VMmark is to compare the performance of different hardware platforms and configurations. Organizations implementing or evaluating virtualization platforms today will use VMmark for comparing performance and scalability of different virtualization platforms, making appropriate hardware choices and for measure platform performance on an ongoing basis.
It is also important to note that VMmark is neither a capacity planning tool nor a sizing tool. It does not provide deployment guidelines for specific applications. Rather VMmark is meant to be representative of a general-purpose virtualization environment. The virtual machine configurations and the software stacks inside the virtual machines are fixed as part of the benchmark specification. Recommendations derived from VMmark results will capture many common cases; however, specialized scenarios will likely require individual measurement.
What are the benefits of VMmark?
With VMmark, organizations now have a robust and reliable and benchmark that captures the key performance characteristics of virtual systems; is representative of end user environments running multiple workloads; is platform neutral and provides a methodical way to measure scalability so that the same benchmark can be used across different hardware platforms.
With VMmark, organizations now finally have a virtualization benchmark that works. With VMmark, organizations can compare performance and scalability of different virtualization platforms, make appropriate hardware choices and monitor virtual machine performance on an ongoing basis.
How do I interpret a VMmark score?
A VMmark score is a measure of the performance of both the hardware and virtualization layers of a virtualization platform. Each score represents the performance relative to a fixed reference platform. Though the reference platform is from a previous hardware generation, making comparisons between it and newer systems not very meaningful, its use allows for easy comparisons between various contemporary platforms and configurations.
A score is obtained by measuring the aggregate throughput achieved by multiple workloads executing simultaneously on the virtualization platform. A set of six specific workloads, each in its own virtual machine, are run for a specific length of time. These six workload virtual machines are collectively defined as a VMmark tile.
During a VMmark run each individual workload generates a raw throughput metric -- for example, the throughput of the database workload is measured in transactions per minute. Upon completion of a run each of these raw metrics is normalized with respect to the reference platform then the geometric mean of the normalized individual scores is computed. The resulting score is a measure of the throughput of the tested platform relative to the reference platform.
In addition to this score, each VMmark result also includes the number of VMmark tiles used in the benchmark run. With increasing system resources (for example, more CPU cores) multiple VMmark tiles (that is, complete sets of the six workload virtual machines) can be run simultaneously in order to fully utilize a virtualization platform. After calculating the score for each tile the individual tile scores are added together to produce the VMmark score.
A VMmark full disclosure report also includes the raw and normalized results for each underlying workload as well as complete details of the virtualization platform configuration. In some cases, studying the workload metrics along with the platform configuration can provide insight into system performance and scaling.
For a more detailed description of the benchmark scoring methodology see the VMmark Benchmarking Guide.
How do I compare VMmark scores across different virtualization platforms?
A higher VMmark score implies that a virtualization platform is capable of sustaining greater throughput in a mixed workload consolidation environment. A larger number of VMmark tiles used to generate the benchmark means that the platform supported more virtual machines during the benchmark run. Typically, a higher benchmark score requires a higher number of tiles.
If two different virtualization platforms achieve similar VMmark scores with a different number of tiles, the score with the lower tile count is generally preferred. The higher tile count could be a sign that the underlying hardware resources were not properly balanced. Studying the individual workload metrics is suggested in these cases.
How is VMmark version 1.1 different from version 1.0?
In order to address the growing prevalence of 64bit applications and OSs, in VMmark 1.1 the Java server, database server, and web server workloads within the tile are 64-bit. The mail server, file server, and standby server remain 32-bit and unchanged from VMmark 1.0.
Are VMmark 1.1 results comparable to VMmark 1.0 results?
Yes, the results are directly comparable. The underlying virtual hardware definitions and load levels for each workload have not changed in VMmark 1.1.