Archive

Archive for the ‘Performance’ Category

Three secrets of the Uber-Virtualized

January 28th, 2010 John Gannon Comments

In October 2009, Gartner estimated that only 16% of workloads worldwide are running in virtual machines, although tremendous growth is expected in the coming years.  Not surprisingly and roughly in line with Gartner’s estimates, most customers tend to be about 20% or 30% virtualized, with ambitious plans for growth in the coming year.

However, some organizations are outliers.  They have virtualized the majority of their IT environment and are seeing benefits above and beyond the typical server consolidation and disaster recovery use cases.

I like to call these folks the uber-virtualized, and in this post I’ll discuss some of the best practices we’ve learned from them!

1)  Pay attention to your storage environment, because there is a good chance it’s where the bottleneck lives!

The fear of storage bottlenecks keeps the uber-virtualized up at night.  When you’ve virtualized most of your IT environment, that is going to cause additional stress on your SAN because of all the virtual disks you’re storing there.  Rather than throwing storage capacity at the problem (at additional cost), much time and effort goes into poring over storage array and VirtualCenter data, trying to find optimization opportunities.

There are certainly some monitoring tools on the market which can aid in this process by gathering numerous bits of utilization and performance data from hosts and storage arrays.  However, these tools (just as VirtualCenter) leave the administrator to make the final decision about how to rebalance the environment and mitigate the risk of storage bottlenecks.  Fortunately, the uber-virtualized have been working with VMware technology for many years, and are often able to make the right decision based on their experience.

VMware has also recognized that storage challenges can really hurt virtualized IT deployments, and have responded by developing new technology like IO DRS.  This is a good first step, although the VMware administrator will need to have the experience to recognize the proper thresholds to configure to trigger a migration.  With 10 or even 100 VMs, this is fairly simple to do.  However, with hundreds or thousands of VMs in an environment that’s tightly managed (50%+ utilization), deeper analysis needs to be done looking at all resources together (CPU, memory, and I/O) before making workload balancing decisions.  Otherwise you’re risking performance problems and downtime.

2) Are you (CPU) Ready?

CPU Ready is one of the key parameters that the most experienced VMware administrators examine when they see a performance problem.  In fact, it is often the first thing they’ll check when debugging.  Learn to love this statistic and what it means, because it can help you identify virtual machines that may be oversized and that cause your environment to perform poorly.

(By the way, here is a nice Powershell script that will grab CPU Ready stats for all of your VMs!)

3) Automate lightly, young Padawan.

One thing that surprised me in talking to the uber-virtualized is that some are skeptical about the use of automation like DRS in their environments.   I thought everyone would be running DRS in fully automated mode, and have their feet up on the couch at home while drinking a beer since their environment was managing itself :)   But there were some organizations that felt that DRS didn’t give them everything they needed (particularly in the IO department).  It will be interesting to see how IO DRS, once released, will address some of these concerns.

Are you one of the uber-virtualized?  Care to share any of your secrets, tips, or tricks?  Please feel free to leave them in the comments, your fellow VMware administrators and architects will appreciate it!

Reblog this post [with Zemanta]
  • Share/Bookmark

Two common VMware CPU performance problems and solutions

January 11th, 2010 John Gannon Comments

This post is the first in a series of posts about identifying and solving VMware related performance problems.  In this post, we’ll briefly describe a couple of common VMware CPU performance problems and their solutions.  These are problems and solutions we’ve heard repeatedly from customers and end users.

PROBLEM

Co-Scheduling CPU Fragmentation

DESCRIPTION

vSMP suffer long delays and throughput degradation

SYMPTOMS

Excessive ready counter; vSMP performance metrics down

RESOLUTION

Reduce CPU loads by moving VMs or reconfigure overallocated vSMP virtual machines to be configured with fewer virtual processors instead

PROBLEM

INTERRUPTS

DESCRIPTION

A VM generates high interrupt rates hogging the CPU

SYMPTOMS

Long waits in ready queue for single vCPU

RESOLUTION

Reduce CPU loads by moving VMs or reconfigure overallocated vSMP virtual machines to be configured with fewer virtual processors instead

Please let us know if you have your own problems, solutions, and best practices to add.

  • Share/Bookmark
Categories: Performance, Resources Tags:

Where In The World is The Virtualized IO Bottleneck? (II)

November 2nd, 2009 vmturbo Comments

This post continues part I to consider two additional sources of potential IO bottlenecks in virtualized environments: randomization of access and microbursting.

RANDOMIZATION OF STORAGE ACCESS

Storage performance can vary by orders of magnitude between sequential and random access. Sequential access rates are bound by transfer rates. For example, a storage supporting 500MBps transfer rate can handle sequential stream  of 8KB records at some 500,000/8~62500 IOps (I/O operations per sec).  In contrast, random access rates are bound by the average seek time. For example, a storage with average seek time of 5ms can handle only 1/0.005=200 of purely random IOps- a minor 0.3%  of the sequential access rates above.

Databases and file systems have thus been designed to optimize access rates through sequential organization of stored data. Storage arrays, likewise, incorporate sophisticated scheduling mechanisms to minimize the penalties of random access and optimize sequential access. In particular, I/O operations are queued and scheduled to minimize access time.

Figure 2 of part I, repeated below, depicts a virtualized storage I/O pipe. A central function of the virtualized pipe is to consolidate I/O workloads. The consolidated flow interleaves the I/O operations of different VMs. Thus, even if each VM generates a stream of perfectly sequential access requests, the consolidated stream may require the storage system to handle purely random access. Blog2fig3

Figure 3: Interleaving of VM I/O Streams

To illustrate the effect of interleaving, consider an idealized worst case scenario of 8 VMs, as in the figure. Each VM generates a perfectly sequential stream of I/O operations. These I/O operations are perfectly interleaved by the virtualized I/O pipes to target the same spindle.  The storage system will see these interleaved requests as pure random accesses. This will penalize these I/O operations with both, random access delays as well as queueing delays by interfering streams.

More generally, I/O workload consolidation can randomize sequential storage accesses by interleaving them. The degree of randomization depends on a large number of factors, ranging from the statistics of I/O workloads of VMs, to the queueing and scheduling mechanisms of the storage array.

One can reduce interleaving effects by carefully separating competing I/O workloads to target different spindles.  This requires careful  tuning and allocation of I/O traffic  among different VMFS, LUNs and hypervisors.

Alternatively, one can  eliminate the impact of randomization by exploiting emerging Enterprise Flash Drive (EFD) storage systems. Flash storage can reduce seek time to sub-ms range, e.g., 0.1ms. At  0.1ms average seek time, the rate of random accesses to storage is 1/0.0001=10,000 IOps, which is commensurable with IOps rates of purely sequential access. Indeed, performance experiments  With EFD storage arrays have been reported to sustain over 350,000 random access  IOps by an ESX server.

MICROBURSTING

A multi Gbps I/O link can generate large bursts of traffic during short durations. For example, an 8Gbps link can transmit 125,000 IOps of 8KB. A microburst of 4 ms, over this link, may generate some 500 I/O operations. Such microbursts may exceed the  buffer capacity along the virtualized I/O pipe, resulting in buffer overflows, losses and  increased latency.

Put differently, an I/O pipe of 8Gbps, with 10ms end-2-end latency, may need to store a (bandwidth)x(delay) product of 1250 I/O operations in its buffers. Furthermore, these I/O operations may not be distributed uniformly through the buffers, but concentrate at some bottleneck links. Microbursts may saturate these bottleneck queues resulting in losses.

Microbursting has been known to disrupt traffic in TCP/IP networks (e.g., see microbursting impact on financial networks).   Advanced  routers deploy traffic shapers to detect and manage microbursting by spreading  bursts.  Detecting microbursting may be challenging, as standard  tools  typically monitor averages over time periods much longer than a burst size and may miss the bursts.

A recent article by Chad Sakac, provides an excellent analysis of  microbursting behaviors of storage I/O in virtualization systems. An interesting question is which buffers, along the I/O pipe,  absorb the microbursts and overflow: the array, fabric or hypervisor queues?  The answer, of course, depends on the specific configuration and buffer sizes scenarios. A subsequent article reports measurements of the hypervisor’s LUN queues overflows; for the scenario considered these overflows were sufficiently rare to be negligible.

Practically speaking, administrators must protect high-speed virtualized I/O pipes against potential microbursting. In particular, they need to configure buffers along the pipe, detect microbursts and the buffers they saturate, and shift VMs and I/O traffic to reduce the pressure on these buffers.

CONCLUSIONS

Virtualization of I/O pipes can give rise to complex potential bottlenecks through interference among consolidated I/O workloads. Interference arises in several forms: (a) competition among traffic streams over shared resources along the I/O pipe; (b) randomization of interleaved sequential access; and (c) condensation of traffic into microbursts.

Emerging NPIV technologies may ease traffic interference,  by extending FC protocols to  support end-2-end flow control  between guest OS’s and storage arrays. This will allow flow control and traffic management mechanisms of FC to regulate and reduce traffic interference.  Emerging  EFD storage technologies accelerate random access and can thus resolve the randomization of consolidated I/O workload. Managing microbursting may become important as higher bandwidth I/O infrastructures are deployed. This may require bandwidth management technologies analogous to those used in high-speed TCP/IP networks.

Regardless of these advances, virtualization system administrators are likely to remain tasked with  I/O performance management. This presents complex challenges, not the least of which is coordinating management of I/O intensive applications and traffic among virtualization administrators, storage administrators and applications administrators.

  • Share/Bookmark
Categories: Performance Tags: ,

Where in the world is the virtual I/O bottleneck? (I)

October 22nd, 2009 vmturbo Comments

This two parts article considers storage IO bottlenecks in virtualization systems.

Background: Storage IO flows

Blog2Fig1

Figure 1:  Physical IO Pipes

Figure 1, above, depicts the storage IO pipe through vanilla physical infrastructure. IO operations flow from the source OS drivers, on the left, through the Host Bus Adapter (HBA) and the SAN fabric, to the Host Interface Card (HIC) of the storage array, on the right, where they are delivered to the target Storage Processor (SP) where they are processed. The HBA, Fabric and HIC use Fiber Channel (FC) protocols to assure reliable and efficient delivery of IO operations.

IO traffic through the pipe traverses a large number of processing elements where it competes with traffic of other pipes and is buffered until processed. This can give rise to congestion conditions, where buffers overflow and drop IO frames. The FC protocols detect these losses and retransmit the frames. This results in reduced thruput and increased latency, which impairs the respective applications.

The FC protocols thus incorporate careful flow control mechanisms to avoid buffer overflows by limiting traffic along both, hop-by-hop links as well as end-to-end connections. These mechanisms control traffic levels to minimize interference among competing workloads of different channels, assure buffer availability along the pipes and enable administrators to balance IO workloads through the fabric and arrays.

Virtualization is a game changer.

Consider a generic scenario of IO flows through virtualization infrastructures, depicted in figure 2 below.

blog2fig2

Figure 2:  Virtualized IO Pipes

The  virtualized IO pipes from VMs to the array are distinct from those of physical pipes, depicted in figure 1, in two ways:

(a)   They traverse additional  hypervisor “IO links”  and queues between the vHBA and HBA; and

(b)  They share common channels between VMFS,  HBA and LUNs

These, seemingly innocuous, distinctions have significant impact:

(1)  IO flow control by FC does not extend to the hypervisor’s “IO links” between the vHBA and HBA; these link-level flow control and end-2-end flow control between the vHBA and array are shifted from automated, adaptive, coordinated channel protocols to hypervisor management by virtualization administrators, and  coordination with  storage and applications administrators

(2)  IO flows of a given VM lose the protections of channel flow-control mechanisms and may be disrupted by IO flows of other VMs sharing their HBA, channel and LUN.

(3)  Storage array performance may too be disrupted through randomization of access by interfering IO flows

(4)  Elusive bottlenecks may emerge, due to short bursts (microbursting), presenting challenging detection, isolation and handling problems

In what follows we consider the first two factors of distinction in details, leaving the last two for the second part of this article.

The Hypervisor Shifts Protocol Functions To Management Responsibilities

Consider first the role of the hypervisor in handling IO flows. The hypervisor “IO links” extend the channels of the HBA with new processing  and buffers.  Traffic along these links cannot be flow-controlled by the channel protocols. The hypervisor thus requires flow-control mechanisms to prevent buffer overflows of its links, as well as end-2-end links. VMware, for example, sets strict limits on the number of IO operations that can be buffered at the hypervisor (typically 32) and requires respective configurations of the VMs (see this article or this one about storage queues and performance).

This converts flow control functions from automated, adaptive  infrastructure protocols to a management function to be handled by virtualization administrators. Furthermore, channel flow control protocols provide end-to-end adaptive traffic control, coordinating flows along intermediate links to avoid bottlenecks. The hypervisor links do not extend this end-to-end control, increasing the possibilities of uncoordinated flows and bottlenecks formation.

Virtualization administrators are thus required to monitor IO traffic flows to detect, analyze and handle disruptions and coordinate these with storage and applications administrators.  Now, VMWare provides rich instrumentation to support this monitoring (see this article about storage analysis and monitoring and this article about vscsi stats). However,  the tasks of monitoring this data, analyzing it, detecting IO disruptions and resolving them can be very challenging and require intimate understanding of storage IO flows and the underlying infrastructure’s operations.

Could one restore flow control over the hypervisor’s IO links to automated protocols?  Recent extensions of the FC standards (discussed below) permit overlays of virtual FC between VMs and the array. These extensions permit channel protocols to protect end-2-end flows from the vHBA to the LUN. Indeed, vSphere supports such channels. However, this requires use of raw storage access provided by RDM. In turn, one cannot use the storage semantics and rich services of VMFS.

Are there other alternatives to restore flow control to automated mechanisms, while preserving rich hypervisor services as provided by VMFS?  This question will be considered in future blogs.

We now turn to the second and more challenging problem of virtualized IO pipes.

IO Flows Can Disrupt Each Other

Consider the IO flows of multiple VMs sharing VMFS and  HBA depicted in figure 2.  Suppose these VMs access different targets and retrieve large amounts of data to be processed by them. The storage arrays may inject these independent IO flows into the fabric over several ports and storage processors. The aggregate thruput of these flows may far exceed the capacity of the fabric port attached to the HBA. This will result in buffer overflows and loss, triggering retransmissions and increased latency.

A physical infrastructure, as depicted in figure 1, avoids such problems by dedicating physical capacity and carefully tuning IO  workloads to this capacity. In contrast, the application administrators of VMs cannot be aware of the IO workloads of other VMs, sharing the physical capacity with them.  For example, a database application may retrieve large tables to compute their join, while a security application may scan a VM storage for viruses. Interference presents a complex challenge when multiple IO-intensive applications share an HBA.

Why is interference in sharing an HBA harder to handle than for CPU sharing? CPU resources are carefully scheduled by automated hypervisor mechanisms, adapt to instantaneous traffic demands and provide guaranteed allocations.  In contrast, HBA resources are scheduled through loose mechanisms managed by administrators, do not adapt to instantaneous traffic and do not provide guaranteed allocations.

Interference and disruptions can emerge through competitive sharing of memory resources, not just HBA. Consider a guest database server requiring physical memory to process large tables. The hypervisor may use ballooning to reclaim physical memory from other VMs and expand its physical memory pool. Now suppose the VMs releasing this physical memory require it back. The memory available to the database server will decline. The hypervisor may swap least-recently-used (LRU) pages of the database server to its swap area. The guest OS of the database server may, too, use an LRU algorithm to swap the same pages to its own swap area. This requires the hypervisor to swap the pages back to physical memory where they may be copied by the guest OS to its swap area. Such interleaved swapping and ballooning can significantly disrupt multiple VMs. The database server, in particular, may be unable to handle the bursts of IO  flows delivering the large tables.

One could, of course, pursue several measures to limit interference. For example, reduce interference over VMFS by dedicating VMFS to IO-intensive applications (this however may create interference through competition of VMFS over memory resources). Similarly, one may  limit IO thruput not to exceed an aggregate utilization of 30% of the HBA capacity. However, IO traffic is very bursty; even if traffic averages meet such pre-set limits, one cannot ignore the disruptive effects of bursts.

Needless to say, one can reduce interference by limiting consolidation ratios for IO-intensive applications. Alternatively, one can over-engineer the IO pathways and memory to minimize interference.  However, both approaches put in question the very reason to virtualize IO-intensive applications. Another alternative, usually pursued by administrators in virtualizing IO-intensive applications, is to consolidate such applications with workloads involving low IO demands, e.g., consolidate a database server with print servers and web servers.

Recent efforts by the T11 standards committee (the NPIV protocol) provide a promising alternative in enabling FC channels to be virtualized and extended from the vHBA to the LUN. This permits FC protocols to allocate end-to-end resources to these virtualized channels, control flows and minimize interference.  These mechanisms can dramatically simplify both the interference and flow control problems. However, there are two limiting factors in using them. First, one has to use RDM to support such virtualized FC channels and abandon the rich  services offered by VMFS. Second, in an environment where VMs can move, the resources allocated to virtualized channels will need to be adapted dynamically to handle redistribution of the IO workloads; this requires challenging management automation tools.

In conclusion, virtualization of IO flows, while seemingly involving trivial changes from physical infrastructures, introduces significant new complexities and potential disruptions of IO flows. Part II considers some additional such challenges of IO virtualization and possible directions to resolve them.

Reblog this post [with Zemanta]
  • Share/Bookmark
Categories: Performance Tags: ,

Should you pursue a VMware performance PhD?

October 9th, 2009 vmturbo Comments

A recent article by David Vellante claims:

The fact is, most data center managers wouldn’t trust VMware to manage their Tier 1 applications because if something goes wrong performance-wise, you still need to roll in the VMware PhDs to solve it.

While such a statement can be controversial, it is difficult to ignore its valuable substance:

  1. Virtualization leads to novel complex performance problems.
  2. Managing these performance problems can be very challenging.
  3. This hinders virtualization of Tier 1 applications which can be very sensitive to performance problems.

In what follows we consider the first two claims.

A VM exports to its guest OS and applications the semantics of the underlying physical resources, but not the performance guarantees they provide. Indeed, an increase in consolidation ratios and utilization of the physical resources, necessarily means an increase in competition among workloads over these resources. This competition, in turn, can breed complex interference patterns and performance problems.

Consider a sample problem scenario. An application administrator approaches you to increase the CPU budget, allocated to their VM, to handle its growing workloads. You double the VM allocation from 2 vCPUs to 4. Surprisingly, the performance of the application degrades rather than improve.

You face a few challenging questions:

  1. What could be the causes for this performance paradox?
  2. What instrumentation should you monitor to analyze the root causes?
  3. How do you resolve the problem?

VMWare provides helpful documentation to handle these challenges.  Guides to performance monitoring and troubleshooting describe CPU problems and can help you address the first question. There are also articles that discuss performance monitoring counters, esxtop metrics, and their diagnostic meaning. For example, you may see an excessive value of the %RDY counter, describing “percent time spent by a VM waiting for CPU(s) to become available”.

Now, why would the VM wait for CPUs for so long? This indicates competition with other VMs. But shouldn’t a 4 vCPUs configuration win a larger competitive share than 2 vCPUs? The answer to this question is provided in this article about SMP coscheduling mechanisms. These mechanisms seek to provide the VM the semantics of 4 vCPUs. However, when CPU resources are under tighter competition, the waiting periods for 4 vCPUs to become available are longer, as described by RDY%. Once this root cause problem has been determined, problem resolution is straightforward (e.g., free CPUs by shifting VMs to other resources).

This process is perhaps what David Vellante meant by “roll in the VMWare PhD’s”. Indeed, it requires intimate familiarity with the hypervisor mechanisms (e.g., coscheduling); understanding the meaning of performance instrumentation counters and their relationships to underlying performance behaviors; correlating the observed symptoms; analyzing the root cause; and handling it. Furthermore, some of these activities require tight collaborations between the virtualization administrators, application administrators and, possibly, the storage administrators.

To be fair, similar difficulties confronted the management of other emerging technologies. For example, in the early 90’s vendors of routers and LAN switches equipped them with management information bases (MIBs) involving thousands of cryptic counters, not just a few scores. Fortunately, a burgeoning management industry has grown tools that quickly relieved network administrators from the needs to earn network-PhDs. These tools have enabled administrators to monitor behaviors in terms of manageable abstractions, rather than cryptic instrumentation; automate the analysis of the instrumentation data through smart algorithms; and simplify and streamline management actions.

The virtualization industry, likewise, needs to replace PhDs in VMWare, with automation and simplification tools that focus on smart analysis and decisions, rather than tracking counter values. Indeed, such simplification and automation of management is a pre-condition to empowering virtualization to offer OPEX scalability, much as it has been offering CAPEX scalability. I will consider these possibilities in future posts.

Reblog this post [with Zemanta]
  • Share/Bookmark