Archive

Posts Tagged ‘VMware’

Three secrets of the Uber-Virtualized

January 28th, 2010 John Gannon Comments

In October 2009, Gartner estimated that only 16% of workloads worldwide are running in virtual machines, although tremendous growth is expected in the coming years.  Not surprisingly and roughly in line with Gartner’s estimates, most customers tend to be about 20% or 30% virtualized, with ambitious plans for growth in the coming year.

However, some organizations are outliers.  They have virtualized the majority of their IT environment and are seeing benefits above and beyond the typical server consolidation and disaster recovery use cases.

I like to call these folks the uber-virtualized, and in this post I’ll discuss some of the best practices we’ve learned from them!

1)  Pay attention to your storage environment, because there is a good chance it’s where the bottleneck lives!

The fear of storage bottlenecks keeps the uber-virtualized up at night.  When you’ve virtualized most of your IT environment, that is going to cause additional stress on your SAN because of all the virtual disks you’re storing there.  Rather than throwing storage capacity at the problem (at additional cost), much time and effort goes into poring over storage array and VirtualCenter data, trying to find optimization opportunities.

There are certainly some monitoring tools on the market which can aid in this process by gathering numerous bits of utilization and performance data from hosts and storage arrays.  However, these tools (just as VirtualCenter) leave the administrator to make the final decision about how to rebalance the environment and mitigate the risk of storage bottlenecks.  Fortunately, the uber-virtualized have been working with VMware technology for many years, and are often able to make the right decision based on their experience.

VMware has also recognized that storage challenges can really hurt virtualized IT deployments, and have responded by developing new technology like IO DRS.  This is a good first step, although the VMware administrator will need to have the experience to recognize the proper thresholds to configure to trigger a migration.  With 10 or even 100 VMs, this is fairly simple to do.  However, with hundreds or thousands of VMs in an environment that’s tightly managed (50%+ utilization), deeper analysis needs to be done looking at all resources together (CPU, memory, and I/O) before making workload balancing decisions.  Otherwise you’re risking performance problems and downtime.

2) Are you (CPU) Ready?

CPU Ready is one of the key parameters that the most experienced VMware administrators examine when they see a performance problem.  In fact, it is often the first thing they’ll check when debugging.  Learn to love this statistic and what it means, because it can help you identify virtual machines that may be oversized and that cause your environment to perform poorly.

(By the way, here is a nice Powershell script that will grab CPU Ready stats for all of your VMs!)

3) Automate lightly, young Padawan.

One thing that surprised me in talking to the uber-virtualized is that some are skeptical about the use of automation like DRS in their environments.   I thought everyone would be running DRS in fully automated mode, and have their feet up on the couch at home while drinking a beer since their environment was managing itself :)   But there were some organizations that felt that DRS didn’t give them everything they needed (particularly in the IO department).  It will be interesting to see how IO DRS, once released, will address some of these concerns.

Are you one of the uber-virtualized?  Care to share any of your secrets, tips, or tricks?  Please feel free to leave them in the comments, your fellow VMware administrators and architects will appreciate it!

Reblog this post [with Zemanta]
  • Share/Bookmark

Two common VMware CPU performance problems and solutions

January 11th, 2010 John Gannon Comments

This post is the first in a series of posts about identifying and solving VMware related performance problems.  In this post, we’ll briefly describe a couple of common VMware CPU performance problems and their solutions.  These are problems and solutions we’ve heard repeatedly from customers and end users.

PROBLEM

Co-Scheduling CPU Fragmentation

DESCRIPTION

vSMP suffer long delays and throughput degradation

SYMPTOMS

Excessive ready counter; vSMP performance metrics down

RESOLUTION

Reduce CPU loads by moving VMs or reconfigure overallocated vSMP virtual machines to be configured with fewer virtual processors instead

PROBLEM

INTERRUPTS

DESCRIPTION

A VM generates high interrupt rates hogging the CPU

SYMPTOMS

Long waits in ready queue for single vCPU

RESOLUTION

Reduce CPU loads by moving VMs or reconfigure overallocated vSMP virtual machines to be configured with fewer virtual processors instead

Please let us know if you have your own problems, solutions, and best practices to add.

  • Share/Bookmark
Categories: Performance, Resources Tags:

5 tips to help you ride the next wave of server virtualization

Dustin Ray  "D-Ray" - surfing-cayuco...
Image by mikebaird via Flickr

After spending most of the last 6 years working in the virtualization space as a vendor (now VMTurbo and previously VMware), it’s funny to see how much things change as well as how much they stay the same.

One thing that has definitely not changed is that it is still very hard to move from what I like to call the ‘1st wave’ of virtualization (test & dev systems or low criticality production systems) to the ‘next wave’ (e.g. business critical production systems, heavily utilized databases).

On that note, I wanted to share a few practical tips that I’ve picked up along the way that have helped customers and partners keep the virtualization momentum going – and can help keep your virtualization momentum going in 2010!

1.  Use a disaster recovery or business continuity project to spur additional virtualization and consolidation.

Most companies have challenges around meeting disaster recovery and business continuity goals.  DR in the physical server world is tedious, error-prone, and in my experience mostly ineffective.  If there is a DR initiative at your company, it is a good bet that some of the problems you are trying to solve could be addressed by virtualizing those systems which  don’t have DR capability today or that have been problematic to recover using traditional physical server techniques.  Another tactic that I really like and have seen a few times is using your test and dev environment for DR.  Most virtual server environments I see still have plenty of capacity with which to handle a burst in the event a DR scenario occurred, so having a hybrid test/dev/DR environment is a great way to leverage an investment you’ve already made.

2.  Use a hardware refresh as an opportunity to virtualize.

Most IT shops refresh their server hardware every few years.  Why not use the refresh as an opportunity to remove hardware from your datacenter while adding flexibility to your operation?  Some of these systems may represent some of the more challenging applications to virtualize, and you may receive some resistance from application owners who are new to virtualization, but the CAPEX (and potentially OPEX) savings will be hard to ignore.

3.  Educate your peers.

Many companies do ‘lunch and learns’ or other informal gatherings where the virtualization team leads will discuss how server virtualization works.  These gatherings are a great way to get your network, storage, and applications guys up to speed with your specific initiatives and virtualization technology, and get them talking and asking questions.  This education and relationship building will pay dividends when you start to move more critical applications into virtual machines and need to work closely with other groups within IT on capacity planning and troubleshooting.  Just ask the network guys, they’ve been getting blamed for years for problems that aren’t theirs!  Fortunately for them (and sadly for the virtualization administrator), the new whipping boy is the virtualization environment, and educating your peers can help mitigate this challenge.

4.   Connect with others in your city or industry who have successfully made it to the ‘next wave’ and gather best practices.

Certainly the web and social networking give us a great way to connect with virtualization experts, but there is still no substitute for face-to-face discussions or phone calls where you can ask questions directly to someone who has done it before.  If you know of another company in your industry or city who have already made it to the next wave of virtualization, and have learned the lessons (good and bad) along the way, reach out to them and see if they’d be open to a discussion.  I’d also recommend including when possible any key peers or managers in these calls and meetings.  This way, they have the opportunity to ask questions as well as internalize the information.

5. Measure and then publicize your success.

Don’t be afraid to let people in your organization know that you’ve saved money, increased responsiveness of IT to the business, and built a strategic, virtualized platform!  Keep an eye on your ‘before’ and ‘after’ metrics, and share them with management as well as folks on the business side.  Your results help build the confidence within your organization that you have a good handle on building and operating a virtualized environment, and are fully capable of onboarding additional applications and business units.

What did I miss?  Are there other techniques that have worked for you?  Please share them in the comments.

Reblog this post [with Zemanta]
  • Share/Bookmark

Where in the world is the virtual I/O bottleneck? (I)

October 22nd, 2009 vmturbo Comments

This two parts article considers storage IO bottlenecks in virtualization systems.

Background: Storage IO flows

Blog2Fig1

Figure 1:  Physical IO Pipes

Figure 1, above, depicts the storage IO pipe through vanilla physical infrastructure. IO operations flow from the source OS drivers, on the left, through the Host Bus Adapter (HBA) and the SAN fabric, to the Host Interface Card (HIC) of the storage array, on the right, where they are delivered to the target Storage Processor (SP) where they are processed. The HBA, Fabric and HIC use Fiber Channel (FC) protocols to assure reliable and efficient delivery of IO operations.

IO traffic through the pipe traverses a large number of processing elements where it competes with traffic of other pipes and is buffered until processed. This can give rise to congestion conditions, where buffers overflow and drop IO frames. The FC protocols detect these losses and retransmit the frames. This results in reduced thruput and increased latency, which impairs the respective applications.

The FC protocols thus incorporate careful flow control mechanisms to avoid buffer overflows by limiting traffic along both, hop-by-hop links as well as end-to-end connections. These mechanisms control traffic levels to minimize interference among competing workloads of different channels, assure buffer availability along the pipes and enable administrators to balance IO workloads through the fabric and arrays.

Virtualization is a game changer.

Consider a generic scenario of IO flows through virtualization infrastructures, depicted in figure 2 below.

blog2fig2

Figure 2:  Virtualized IO Pipes

The  virtualized IO pipes from VMs to the array are distinct from those of physical pipes, depicted in figure 1, in two ways:

(a)   They traverse additional  hypervisor “IO links”  and queues between the vHBA and HBA; and

(b)  They share common channels between VMFS,  HBA and LUNs

These, seemingly innocuous, distinctions have significant impact:

(1)  IO flow control by FC does not extend to the hypervisor’s “IO links” between the vHBA and HBA; these link-level flow control and end-2-end flow control between the vHBA and array are shifted from automated, adaptive, coordinated channel protocols to hypervisor management by virtualization administrators, and  coordination with  storage and applications administrators

(2)  IO flows of a given VM lose the protections of channel flow-control mechanisms and may be disrupted by IO flows of other VMs sharing their HBA, channel and LUN.

(3)  Storage array performance may too be disrupted through randomization of access by interfering IO flows

(4)  Elusive bottlenecks may emerge, due to short bursts (microbursting), presenting challenging detection, isolation and handling problems

In what follows we consider the first two factors of distinction in details, leaving the last two for the second part of this article.

The Hypervisor Shifts Protocol Functions To Management Responsibilities

Consider first the role of the hypervisor in handling IO flows. The hypervisor “IO links” extend the channels of the HBA with new processing  and buffers.  Traffic along these links cannot be flow-controlled by the channel protocols. The hypervisor thus requires flow-control mechanisms to prevent buffer overflows of its links, as well as end-2-end links. VMware, for example, sets strict limits on the number of IO operations that can be buffered at the hypervisor (typically 32) and requires respective configurations of the VMs (see this article or this one about storage queues and performance).

This converts flow control functions from automated, adaptive  infrastructure protocols to a management function to be handled by virtualization administrators. Furthermore, channel flow control protocols provide end-to-end adaptive traffic control, coordinating flows along intermediate links to avoid bottlenecks. The hypervisor links do not extend this end-to-end control, increasing the possibilities of uncoordinated flows and bottlenecks formation.

Virtualization administrators are thus required to monitor IO traffic flows to detect, analyze and handle disruptions and coordinate these with storage and applications administrators.  Now, VMWare provides rich instrumentation to support this monitoring (see this article about storage analysis and monitoring and this article about vscsi stats). However,  the tasks of monitoring this data, analyzing it, detecting IO disruptions and resolving them can be very challenging and require intimate understanding of storage IO flows and the underlying infrastructure’s operations.

Could one restore flow control over the hypervisor’s IO links to automated protocols?  Recent extensions of the FC standards (discussed below) permit overlays of virtual FC between VMs and the array. These extensions permit channel protocols to protect end-2-end flows from the vHBA to the LUN. Indeed, vSphere supports such channels. However, this requires use of raw storage access provided by RDM. In turn, one cannot use the storage semantics and rich services of VMFS.

Are there other alternatives to restore flow control to automated mechanisms, while preserving rich hypervisor services as provided by VMFS?  This question will be considered in future blogs.

We now turn to the second and more challenging problem of virtualized IO pipes.

IO Flows Can Disrupt Each Other

Consider the IO flows of multiple VMs sharing VMFS and  HBA depicted in figure 2.  Suppose these VMs access different targets and retrieve large amounts of data to be processed by them. The storage arrays may inject these independent IO flows into the fabric over several ports and storage processors. The aggregate thruput of these flows may far exceed the capacity of the fabric port attached to the HBA. This will result in buffer overflows and loss, triggering retransmissions and increased latency.

A physical infrastructure, as depicted in figure 1, avoids such problems by dedicating physical capacity and carefully tuning IO  workloads to this capacity. In contrast, the application administrators of VMs cannot be aware of the IO workloads of other VMs, sharing the physical capacity with them.  For example, a database application may retrieve large tables to compute their join, while a security application may scan a VM storage for viruses. Interference presents a complex challenge when multiple IO-intensive applications share an HBA.

Why is interference in sharing an HBA harder to handle than for CPU sharing? CPU resources are carefully scheduled by automated hypervisor mechanisms, adapt to instantaneous traffic demands and provide guaranteed allocations.  In contrast, HBA resources are scheduled through loose mechanisms managed by administrators, do not adapt to instantaneous traffic and do not provide guaranteed allocations.

Interference and disruptions can emerge through competitive sharing of memory resources, not just HBA. Consider a guest database server requiring physical memory to process large tables. The hypervisor may use ballooning to reclaim physical memory from other VMs and expand its physical memory pool. Now suppose the VMs releasing this physical memory require it back. The memory available to the database server will decline. The hypervisor may swap least-recently-used (LRU) pages of the database server to its swap area. The guest OS of the database server may, too, use an LRU algorithm to swap the same pages to its own swap area. This requires the hypervisor to swap the pages back to physical memory where they may be copied by the guest OS to its swap area. Such interleaved swapping and ballooning can significantly disrupt multiple VMs. The database server, in particular, may be unable to handle the bursts of IO  flows delivering the large tables.

One could, of course, pursue several measures to limit interference. For example, reduce interference over VMFS by dedicating VMFS to IO-intensive applications (this however may create interference through competition of VMFS over memory resources). Similarly, one may  limit IO thruput not to exceed an aggregate utilization of 30% of the HBA capacity. However, IO traffic is very bursty; even if traffic averages meet such pre-set limits, one cannot ignore the disruptive effects of bursts.

Needless to say, one can reduce interference by limiting consolidation ratios for IO-intensive applications. Alternatively, one can over-engineer the IO pathways and memory to minimize interference.  However, both approaches put in question the very reason to virtualize IO-intensive applications. Another alternative, usually pursued by administrators in virtualizing IO-intensive applications, is to consolidate such applications with workloads involving low IO demands, e.g., consolidate a database server with print servers and web servers.

Recent efforts by the T11 standards committee (the NPIV protocol) provide a promising alternative in enabling FC channels to be virtualized and extended from the vHBA to the LUN. This permits FC protocols to allocate end-to-end resources to these virtualized channels, control flows and minimize interference.  These mechanisms can dramatically simplify both the interference and flow control problems. However, there are two limiting factors in using them. First, one has to use RDM to support such virtualized FC channels and abandon the rich  services offered by VMFS. Second, in an environment where VMs can move, the resources allocated to virtualized channels will need to be adapted dynamically to handle redistribution of the IO workloads; this requires challenging management automation tools.

In conclusion, virtualization of IO flows, while seemingly involving trivial changes from physical infrastructures, introduces significant new complexities and potential disruptions of IO flows. Part II considers some additional such challenges of IO virtualization and possible directions to resolve them.

Reblog this post [with Zemanta]
  • Share/Bookmark
Categories: Performance Tags: ,

Should you pursue a VMware performance PhD?

October 9th, 2009 vmturbo Comments

A recent article by David Vellante claims:

The fact is, most data center managers wouldn’t trust VMware to manage their Tier 1 applications because if something goes wrong performance-wise, you still need to roll in the VMware PhDs to solve it.

While such a statement can be controversial, it is difficult to ignore its valuable substance:

  1. Virtualization leads to novel complex performance problems.
  2. Managing these performance problems can be very challenging.
  3. This hinders virtualization of Tier 1 applications which can be very sensitive to performance problems.

In what follows we consider the first two claims.

A VM exports to its guest OS and applications the semantics of the underlying physical resources, but not the performance guarantees they provide. Indeed, an increase in consolidation ratios and utilization of the physical resources, necessarily means an increase in competition among workloads over these resources. This competition, in turn, can breed complex interference patterns and performance problems.

Consider a sample problem scenario. An application administrator approaches you to increase the CPU budget, allocated to their VM, to handle its growing workloads. You double the VM allocation from 2 vCPUs to 4. Surprisingly, the performance of the application degrades rather than improve.

You face a few challenging questions:

  1. What could be the causes for this performance paradox?
  2. What instrumentation should you monitor to analyze the root causes?
  3. How do you resolve the problem?

VMWare provides helpful documentation to handle these challenges.  Guides to performance monitoring and troubleshooting describe CPU problems and can help you address the first question. There are also articles that discuss performance monitoring counters, esxtop metrics, and their diagnostic meaning. For example, you may see an excessive value of the %RDY counter, describing “percent time spent by a VM waiting for CPU(s) to become available”.

Now, why would the VM wait for CPUs for so long? This indicates competition with other VMs. But shouldn’t a 4 vCPUs configuration win a larger competitive share than 2 vCPUs? The answer to this question is provided in this article about SMP coscheduling mechanisms. These mechanisms seek to provide the VM the semantics of 4 vCPUs. However, when CPU resources are under tighter competition, the waiting periods for 4 vCPUs to become available are longer, as described by RDY%. Once this root cause problem has been determined, problem resolution is straightforward (e.g., free CPUs by shifting VMs to other resources).

This process is perhaps what David Vellante meant by “roll in the VMWare PhD’s”. Indeed, it requires intimate familiarity with the hypervisor mechanisms (e.g., coscheduling); understanding the meaning of performance instrumentation counters and their relationships to underlying performance behaviors; correlating the observed symptoms; analyzing the root cause; and handling it. Furthermore, some of these activities require tight collaborations between the virtualization administrators, application administrators and, possibly, the storage administrators.

To be fair, similar difficulties confronted the management of other emerging technologies. For example, in the early 90’s vendors of routers and LAN switches equipped them with management information bases (MIBs) involving thousands of cryptic counters, not just a few scores. Fortunately, a burgeoning management industry has grown tools that quickly relieved network administrators from the needs to earn network-PhDs. These tools have enabled administrators to monitor behaviors in terms of manageable abstractions, rather than cryptic instrumentation; automate the analysis of the instrumentation data through smart algorithms; and simplify and streamline management actions.

The virtualization industry, likewise, needs to replace PhDs in VMWare, with automation and simplification tools that focus on smart analysis and decisions, rather than tracking counter values. Indeed, such simplification and automation of management is a pre-condition to empowering virtualization to offer OPEX scalability, much as it has been offering CAPEX scalability. I will consider these possibilities in future posts.

Reblog this post [with Zemanta]
  • Share/Bookmark