VMTurbo hits the road for CloudConnect

February 2nd, 2010 John Gannon Comments

Our CEO Shmuel Kliger will be a panelist at CloudConnect (March 15-18 in Santa Clara) where he’ll be participating in a discussion on Cloud Economics and new pricing models, moderated by Cloud Economics guru and AT&T VP Joe Weinman.

Here’s the description of the panel from the CloudConnect site:

Cloud ROI just entered a new phase with the introduction of spot pricing.  But wait; there’s more!  This panel of leaders will address cloud pricing and market system evolution and what it will mean for ROI.  Buyer’s councils, intercloud economics, new cloud resource allocation systems that use bidding and auctions to determine resources to invoke will all be part of this emerging cloudscape.

If you’re planning on attending, stop by and say hello.

Shmuel would love to chat with you, especially if you are a long suffering Jets or Mets fan! :)


  • Share/Bookmark

Three secrets of the Uber-Virtualized

January 28th, 2010 John Gannon Comments

In October 2009, Gartner estimated that only 16% of workloads worldwide are running in virtual machines, although tremendous growth is expected in the coming years.  Not surprisingly and roughly in line with Gartner’s estimates, most customers tend to be about 20% or 30% virtualized, with ambitious plans for growth in the coming year.

However, some organizations are outliers.  They have virtualized the majority of their IT environment and are seeing benefits above and beyond the typical server consolidation and disaster recovery use cases.

I like to call these folks the uber-virtualized, and in this post I’ll discuss some of the best practices we’ve learned from them!

1)  Pay attention to your storage environment, because there is a good chance it’s where the bottleneck lives!

The fear of storage bottlenecks keeps the uber-virtualized up at night.  When you’ve virtualized most of your IT environment, that is going to cause additional stress on your SAN because of all the virtual disks you’re storing there.  Rather than throwing storage capacity at the problem (at additional cost), much time and effort goes into poring over storage array and VirtualCenter data, trying to find optimization opportunities.

There are certainly some monitoring tools on the market which can aid in this process by gathering numerous bits of utilization and performance data from hosts and storage arrays.  However, these tools (just as VirtualCenter) leave the administrator to make the final decision about how to rebalance the environment and mitigate the risk of storage bottlenecks.  Fortunately, the uber-virtualized have been working with VMware technology for many years, and are often able to make the right decision based on their experience.

VMware has also recognized that storage challenges can really hurt virtualized IT deployments, and have responded by developing new technology like IO DRS.  This is a good first step, although the VMware administrator will need to have the experience to recognize the proper thresholds to configure to trigger a migration.  With 10 or even 100 VMs, this is fairly simple to do.  However, with hundreds or thousands of VMs in an environment that’s tightly managed (50%+ utilization), deeper analysis needs to be done looking at all resources together (CPU, memory, and I/O) before making workload balancing decisions.  Otherwise you’re risking performance problems and downtime.

2) Are you (CPU) Ready?

CPU Ready is one of the key parameters that the most experienced VMware administrators examine when they see a performance problem.  In fact, it is often the first thing they’ll check when debugging.  Learn to love this statistic and what it means, because it can help you identify virtual machines that may be oversized and that cause your environment to perform poorly.

(By the way, here is a nice Powershell script that will grab CPU Ready stats for all of your VMs!)

3) Automate lightly, young Padawan.

One thing that surprised me in talking to the uber-virtualized is that some are skeptical about the use of automation like DRS in their environments.   I thought everyone would be running DRS in fully automated mode, and have their feet up on the couch at home while drinking a beer since their environment was managing itself :)   But there were some organizations that felt that DRS didn’t give them everything they needed (particularly in the IO department).  It will be interesting to see how IO DRS, once released, will address some of these concerns.

Are you one of the uber-virtualized?  Care to share any of your secrets, tips, or tricks?  Please feel free to leave them in the comments, your fellow VMware administrators and architects will appreciate it!

Reblog this post [with Zemanta]
  • Share/Bookmark

Two common VMware CPU performance problems and solutions

January 11th, 2010 John Gannon Comments

This post is the first in a series of posts about identifying and solving VMware related performance problems.  In this post, we’ll briefly describe a couple of common VMware CPU performance problems and their solutions.  These are problems and solutions we’ve heard repeatedly from customers and end users.

PROBLEM

Co-Scheduling CPU Fragmentation

DESCRIPTION

vSMP suffer long delays and throughput degradation

SYMPTOMS

Excessive ready counter; vSMP performance metrics down

RESOLUTION

Reduce CPU loads by moving VMs or reconfigure overallocated vSMP virtual machines to be configured with fewer virtual processors instead

PROBLEM

INTERRUPTS

DESCRIPTION

A VM generates high interrupt rates hogging the CPU

SYMPTOMS

Long waits in ready queue for single vCPU

RESOLUTION

Reduce CPU loads by moving VMs or reconfigure overallocated vSMP virtual machines to be configured with fewer virtual processors instead

Please let us know if you have your own problems, solutions, and best practices to add.

  • Share/Bookmark
Categories: Performance, Resources Tags:

Can Theory Shed Light On Consolidation?(II)

January 6th, 2010 vmturbo Comments

(This is the second in the series of two articles about workload consolidation in virtual machine environments.  The first article is linked here.)

The Bottom Line

There are several practical applications one can draw from various mathematical and computer science theories related to capacity and workload management of virtual machine environments.

  1. Merging workloads can yield significant acceleration of processing during periods of high utilization.
  2. Aggregating processing capacity can accelerate processing at all traffic levels.
  3. Alternatively, consolidating workloads and processing capacity can enable higher utilization of resources while keeping processing delays within target QoS bounds.
  4. These performance gains are higher, the higher the consolidation ratio. Consolidation of workloads across  a cluster yields higher gains than consolidation of a single host; and consolidation across a data center, or a cloud, yields higher gains than a single cluster. It is thus best to consider consolidation as an expansive hierarchy of resource sharing schedulers.
  5. These performance gains require that the consolidated workloads be maximally uncorrelated.

Merging Workloads and Aggregating Processing

Consider  the consolidation modes depicted in figure 1 below. Part (a) shows  k workload streams processed by dedicated processors; the heavy yellow workloads is queued for its service processor S1, even when the green and blue processors (S1, Sk) are not used by their respective lighter workloads. Part (b) shows workload merging into a single stream, sharing the pool of service processors.  An arriving processing task may now access any of the processors. This  eliminates scenarios   where tasks are queued while some processors are idle, as in dedicated processing.   pix2.1

Figure 1:  Consolidation Modes

Part  (c) shows capacity aggregation of service processors  into a single processor with the aggregate capacity.  An arriving task can utilize the entire aggregate capacity of all the service processors to accelerate its processing.

Example: Consolidation of Network Traffic

To illustrate the two consolidation modes, workload merging (1b) and capacity aggregation (1c), consider the network workloads of k=10 VMs. The service processors S1,S2…Sn…S10 represent 1GE NIC at the host. The workloads represent packet streams generated by the VMs sharing the hosts.  Workloads merging, depicted in (1b), is provided by virtual switch (vSwitch) that schedules packets, of all streams,  to the  10 NICs . This consolidation will produce multiplexing gains by permitting the streams to share the pool of processors and thus use their capacity more efficiently. The capacity aggregation, depicted in (1c), replaces the 10 1GE NICs with a single 10GE NIC. This consolidation will result in additional multiplexing gain as packets processing times are significantly reduced by applying the full aggregate processing capacity.

Consolidation Creates Multiplexing Gains

Merging workloads, as in (1b),  clearly uses the service processors more efficiently than dedicating processors to individual streams, as in (1a). Capacity aggregation provides additional efficiency; an arriving task can use the full aggregate capacity of the k processors, not just one of them.

These multiplexing gains can be measured by the respective reductions in delays, as defined in part I:

GM =Tdedicated/Tmerged

where Tdedicated is the average processing delay  of the dedicated streams of (1a), while  Tmerged is the average delay of the workload merging consolidation(1b).  Similarly

GA =Tdedicated/Taggregated

where Taggregated is the average delay of the capacity aggregation consolidation (1c).

What are the  multiplexing gains GM and GA of the two consolidation modes?

We will now use elementary queueing theory to address this question and gain further insights into consolidation modes.

Brief Intro to Queueing Analysis of Workload Processing

Queueing theory describes a processing service in terms of three parameters: “A/B/k”  where A describes the statistical distribution of the jobs inter-arrival times (workload); B described the distribution of service processing times; and k describes the number of servers. Additional assumptions may describe statistical independence of arrivals and service, the queueing discipline  (e.g., FIFO, LIFO ) and various service features.

For example, the  M/M/1 queueing service model describes a single server system handling a workload stream with statistically independent exponentially distributed inter-arrival times and service times. The letter “M”  indicates the memorylessness  of the exponential distribution: the future is independent of its past. M/M/k models provide a standard base to explore the performance of a workload processing services.

The exponential distribution is characterized by its rate. We use  L (load) to indicate the arrival rate and C (capacity) to indicate the service rate. The utilization of the service is defined as U=L/C,  the fraction of capacity used by arriving workload. We note in passing that when one merges k streams with exponentially distributed arrivals at rate L, the merged stream also has exponentially distributed arrivals at rate kL.

The performance of the three models, depicted in figure 1, may be analyzed using M/M/k  analysis. Assume that the workload arrivals  are  exponentially distributed with rate L; while service times are exponentially distributed with rate C. Figure (1a) depicts k independent M/M/1 systems. Figure (1b) depicts a merging of these k systems into a single M/M/k system, with arrival rate of kL. Figure (1c) depicts an M/M/1 system with aggregate arrival rate of kL and aggregate service rate of kC.  Note that the utilization U=L/C=kL/kC remains similar in all models. However, processing delays will vary greatly.

Queueing Analysis of  The Multiplexing Gains

We are now ready to analyze the multiplexing gains of the workload merging and capacity aggregation consolidation modes, depicted in figure 1.

The average delay of an M/M/1 stream, as in (1a),  is given by Tdedicated=1/(C-L).

The average delay of an M/M/k stream, as in (1b), is given by

Tmerged=(1/C)+P(k,U)/(kC-L)    where P(k,U) is the probability of queueing

For U~0 (low utilization) P(k,U)~0 and thus T0merged~1/C ; that is, the waiting time is approximately the service time. For U~1 (high utilization) P(k,U)~1 and thus T1merged~1/(kC-L).

Therefore, when U~0 (light load): Gmerged=Tdedicated/T0merged= C/(C-L)=1/(1-U) ~1;  there is no gain in merging workloads to  share processors. However, when U~1 (heavy load):  Gmerged=Tdedicated/T1merged=(kC-L)/(C-L)=(k-U)/(1-U)~k; the multiplexing gain is proportional to the consolidation ratio k.

To summarize, Gmerged~1 for light utilization, while for high utilization  Gmerged~k .

For example, if  network traffic of 10 VMs is merged by a vSwitch into 10 NICs, during heavy traffic the multiplexing gains will converge to the consolidation ratio 10. When traffic is light, gains will be minimal.

Consider now the aggregated stream of figure (1c). The average delay  is given by Taggregated=1/(kC-kL)=(1/k)Tdedicated

Therefore GA =Tdedicated/Taggregated =k

For example, suppose network traffic of 10 VMs is aggregated into a 10GE NIC . The traffic processing delays will be reduced by a factor of 10 over using 10 dedicated 1GE NICs.

Now compare workload merging with capacity aggregation modes. Under light traffic  GM ~1, that is, merging produces no gains over dedicated resources. In contrast, GA =k and thus, even under light traffic, processing is greatly accelerated. However, for high utilization GM ~GA ~k. That is, both merging provides similar gains as capacity aggregation.

Finally, one should keep in mind that theory is, at best, a valuable approximation of reality. The M/M/k analysis above is based on assumptions that may, or may not, be valid for specific real scenarios. For example, workloads may be correlated, violating the assumptions of M/M/k models. Therefore, it is best to view the theory as a quantitative underpinning of the qualitative performance behaviors and focus on these.

In summary consolidation of k  M/M/1 systems can reduces processing delays through periods of high-utilization,  by a factor similar to the consolidation ratio.

Applications

We now apply the theoretical considerations above to several resource sharing scenarios in virtualization systems.

Memory Sharing

Memory pages may be considered as processors of workloads demands. If these pages are partitioned among VMs — i.e., through reservations — the resulting dedicated processing of memory requests is described by figure (1a).  Such memory reservations can result in great inefficiencies as some VMs queue their memory pages in swap areas, while other VMs are under-utilizing their memory shares. This can have dramatic impact on performance. Therefore, it is desirable to permit sharing of unused memory. Such sharing can be handled by allocating the pool of unused memory according to “shares” of respective VMs.

If   k is the number of pages in the shared pool, the M/M/k model of figure (1b) provides a useful  approximation of this merging of memory requests streams. When utilization is high, one can expect the multiplexing gains in access delays to improve by a factor of k, over dedicated reservations. Unfortunately, when utilization increases the size of the shared pool tend to decrease and the gains in access speeds may be offset by swapping overheads. Therefore, it is useful to max the size of the shared pool to rip the benefits of consolidation. The ballooning mechanisms of VMWare are an example how the virtualization system can “steal” unused dedicated memory to increase the shared pool.

Storage IO  Sharing in SANs

Storage IO streams share the processing capacity of an HBA and SAN fabric behind it. Storage protocols partition the HBA capacity among IO channels to the storage array. Such partition of IO processing exhibits the performance of  dedicated streams as in figure (1a).

Virtualization systems, however, often merge the IO streams of multiple VMs into a single IO channel. If all VMs IO traffic is consolidated into a single IO channel, the resulting stream can exhibit the performance gains of the capacity aggregation mode of  figure (1c).  On the surface such consolidation would max the multiplexing gains.  Unfortunately, consolidation of storage IO can lead to inconsistencies with the storage array mechanisms. For example, merging of storage IO operations streams may disrupt sequential access, leading to significant slow down of storage access.

Therefore, consolidation should not be considered panacea. The benefits of multiplexing gains may be offset by limitations of other mechanisms. In the case of storage IO, the root obstacle to consolidation is merging IO streams into a single IO channel. The semantic of a “channel” expected by a storage array is inconsistent with merged streams.  The problem can be solved by maintaining separate channel identifier for each stream (virtual channel), yet sharing the underlying physical processing capacity among them. This permits the storage array to optimize the processing of each virtual channel, while benefiting from the multiplexing gains in using underlying processing resources.

Maxing Utilization vs. Delay

The discussion above focused on multiplexing gains in reducing delays. More often, one is interested in using consolidation to max utilization while keeping delays within a given target range.

This, however, can be resolved by a simple twist of the analysis above. Consider for example the capacity aggregation of figure (1c). Suppose one merges the k workload streams but aggregates the capacity of only m<k processors. The utilization of the consolidated system is U*=kL/mC=(k/m)U. Thus, the utilization gain is U*/U=k/m.

The average delay contracts from T=1/(C-L) to T*=1/(mC-kL). So G=T/T*=(mC-kL)/(C-L)=(m-kU)/(1-U).  Suppose one wishes to double the average utilization by using m=k/2. Then G=k(1/2-U)/(1-U). When U is sufficiently small, G~k/2; so utilization can be doubled, while keeping significant multiplexing gains proportional to half the consolidation ratio.

Reblog this post [with Zemanta]
  • Share/Bookmark

5 tips to help you ride the next wave of server virtualization

Dustin Ray  "D-Ray" - surfing-cayuco...
Image by mikebaird via Flickr

After spending most of the last 6 years working in the virtualization space as a vendor (now VMTurbo and previously VMware), it’s funny to see how much things change as well as how much they stay the same.

One thing that has definitely not changed is that it is still very hard to move from what I like to call the ‘1st wave’ of virtualization (test & dev systems or low criticality production systems) to the ‘next wave’ (e.g. business critical production systems, heavily utilized databases).

On that note, I wanted to share a few practical tips that I’ve picked up along the way that have helped customers and partners keep the virtualization momentum going – and can help keep your virtualization momentum going in 2010!

1.  Use a disaster recovery or business continuity project to spur additional virtualization and consolidation.

Most companies have challenges around meeting disaster recovery and business continuity goals.  DR in the physical server world is tedious, error-prone, and in my experience mostly ineffective.  If there is a DR initiative at your company, it is a good bet that some of the problems you are trying to solve could be addressed by virtualizing those systems which  don’t have DR capability today or that have been problematic to recover using traditional physical server techniques.  Another tactic that I really like and have seen a few times is using your test and dev environment for DR.  Most virtual server environments I see still have plenty of capacity with which to handle a burst in the event a DR scenario occurred, so having a hybrid test/dev/DR environment is a great way to leverage an investment you’ve already made.

2.  Use a hardware refresh as an opportunity to virtualize.

Most IT shops refresh their server hardware every few years.  Why not use the refresh as an opportunity to remove hardware from your datacenter while adding flexibility to your operation?  Some of these systems may represent some of the more challenging applications to virtualize, and you may receive some resistance from application owners who are new to virtualization, but the CAPEX (and potentially OPEX) savings will be hard to ignore.

3.  Educate your peers.

Many companies do ‘lunch and learns’ or other informal gatherings where the virtualization team leads will discuss how server virtualization works.  These gatherings are a great way to get your network, storage, and applications guys up to speed with your specific initiatives and virtualization technology, and get them talking and asking questions.  This education and relationship building will pay dividends when you start to move more critical applications into virtual machines and need to work closely with other groups within IT on capacity planning and troubleshooting.  Just ask the network guys, they’ve been getting blamed for years for problems that aren’t theirs!  Fortunately for them (and sadly for the virtualization administrator), the new whipping boy is the virtualization environment, and educating your peers can help mitigate this challenge.

4.   Connect with others in your city or industry who have successfully made it to the ‘next wave’ and gather best practices.

Certainly the web and social networking give us a great way to connect with virtualization experts, but there is still no substitute for face-to-face discussions or phone calls where you can ask questions directly to someone who has done it before.  If you know of another company in your industry or city who have already made it to the next wave of virtualization, and have learned the lessons (good and bad) along the way, reach out to them and see if they’d be open to a discussion.  I’d also recommend including when possible any key peers or managers in these calls and meetings.  This way, they have the opportunity to ask questions as well as internalize the information.

5. Measure and then publicize your success.

Don’t be afraid to let people in your organization know that you’ve saved money, increased responsiveness of IT to the business, and built a strategic, virtualized platform!  Keep an eye on your ‘before’ and ‘after’ metrics, and share them with management as well as folks on the business side.  Your results help build the confidence within your organization that you have a good handle on building and operating a virtualized environment, and are fully capable of onboarding additional applications and business units.

What did I miss?  Are there other techniques that have worked for you?  Please share them in the comments.

Reblog this post [with Zemanta]
  • Share/Bookmark

Top VMTurbo Blog Posts of 2009

December 30th, 2009 John Gannon Comments
Times Square Ball - New Year's Eve 2008
Image by Atomische • Tom Giebel via Flickr

Granted, we  have another day or so until 2010, but I thought it was still a good time to list the  most popular VMTurbo blog posts in 2009.

Here we go…

  1. Bridging the Virtualization Management Gaps by our CEO Shmuel Kliger
  2. Where is the virtual I/O bottleneck?
  3. Can theory shed light on workload consolidation?

Thanks to everyone who has supported us in 2009 and our best wishes to you and your families in 2010!

Reblog this post [with Zemanta]
  • Share/Bookmark
Categories: VMTurbo Tags:

2010 IT Management Predictions

December 22nd, 2009 John Gannon Comments

It’s that time of year again!  Our friends at Aprigo surveyed various IT management experts (including our CEO, Shmuel Kliger) to come up with a set of 2010 IT Management Predictions.  Take a look (below) and feel free to share a prediction of your own.  We’d love to hear your thoughts.

  • Share/Bookmark
Categories: VMTurbo Tags:

Cloud computing economics on my mind

December 14th, 2009 John Gannon Comments
Chart of the Dow Jones Industrial Average duri...
Image via Wikipedia

Over the last few days, there have been a couple of interesting developments and posts from the cloud community that have me thinking about the economic ramifications of cloud computing.

Amazon just announced spot pricing for EC2 instances:  Amazon users can now name their price for an EC2 instance and if capacity is available at that price, the instance will be purchased.  Werner Vogels (Amazon CTO) discusses this in more detail on his blog.

Hedging Your Options for the Cloud:  In this GigaOM post, Joe Weinman of AT&T discusses a variety of economic and business models for cloud computing.

Taking pricing models used in other industries (e.g. airlines, hotels, manufacturing, internet advertising) and applying them to the management of clouds makes sense – especially as compute power and application capacity becomes commoditized.  However, there is still much work to be done to achieve the same level of pricing and service granularity in the application world.

For instance, a unit of cloud CPU power or cloud storage is somewhat fungible, and it is fairly clear what you’re purchasing for your money.  On the other hand, trying to appropriately value, sell, or trade a unit of specific application capacity running within an enterprise is not.

For these new pricing models to have the most impact and business value to end users of IT, they’ll need to be viable in the cloud as well within the four walls of the enterprise.

Reblog this post [with Zemanta]
  • Share/Bookmark

The Data Center of the Future vs. The Future of the Data Center

December 9th, 2009 John Gannon Comments

A good presentation on the future of the datacenter by Richard Scannell, founder of the IT services firm Glasshouse.  Food for thought!

Reblog this post [with Zemanta]
  • Share/Bookmark

Bridging the Virtualization Management Gaps

December 7th, 2009 shmuel Comments
A basic digital clock radio with analog tuning.
Image via Wikipedia

“It’s midnight: do you know where your applications are?”

It used to be somewhat easy. The world was static with relatively few moving parts. A single application on a single OS on a single server with attached storage. With a few (sometime more than a few) pointed niche management tools we were able to get our hands around and manage our environments. Well, virtualization changes everything.  No more static boundaries and well defined interactions between the IT silos.

Do you know where your applications are?  Do you know where your virtual machines are?  Do you know what resources they are using?  Do you know how they perform?  Do they need more or less resources to deliver on their goals?   Are there bottlenecks in your environment? Where are the bottlenecks?

More important! Do you know what you need to DO now? In the next minute? Hour? Day? Week? Month? Do you need to start a new VM? Stop a VM? Move a VM? Do you know where to start/move the VM? Do you need to reconfigure any of its resources?  Do you need to provide more resources? What do you need to do to address the bottlenecks? How do you prevent them?

Are the management tools up to the tasks at hand? The answer is NO!

Virtualization brings down the walls between the silos, but management is lagging far behind and continuing on the trajectory to nowhere of the last decades. There are five fundamental gaps in today’s management environment that we need to bridge before we can address the challenges and opportunities introduced by virtualization:

  • Business Gap – IT management is not aligned with the supported business and is not governed by business driven goals and polices.
  • Management Information Gap – IT is focused on collecting too much data about the infrastructure that has to be deciphered, creating a huge gap between the raw data that is collected and the meaningful, actionable, information required for intelligent cost effective operation.
  • Technology Gap – Heterogeneous environments made up of a wide variety of technologies spanning from networks, through server, storage, and all the way up to the applications. All are parts of the IT stack participating in delivering business service, yet they are being managed in silos. Furthermore, within each layer of the stack, each technology and each product is managed separately.
  • Operational Gap – Operational disciplines, fault, performance, planning, configuration, provisioning, accounting/billing, security are all done separately.
  • Automation Gap – To reduce complexity and operational costs, we strive for self- managed environments, i.e., environments that configure themselves, optimize themselves, secure themselves and practically heal themselves. Yet, these self-managed automation functions are tightly interacting with the managed entities, dealing with the detailed mess of the different entities in the environment. There is no decomposition of services and no decoupling of functional layers. We end up with one big monolithic mess of interactions.

A proper foundation for bridging the gaps must have three tenets, Abstraction, Analysis, and Automation:

  • Abstraction – a layer of abstraction provides a model of the managed environments. The models hide the details of the environments by providing a common abstraction of the heterogeneous environments, exposing a common, rich, semantic interface for interacting with the environments.
  • Analysis – analysis engines driven by the knowledge captured and represented by the abstraction layer to drive intelligent decision automation.
  • Automation – expose the foundations as collections of services orchestrated using workflow engines driven by business policies.

Utilizing these three tenets provides the proper foundation for virtualization management solutions that will:

  • Provide intelligent actionable information with minimal monitoring information, reversing the current trend of chasing too much information and trying to discover, represent and rely on continuously changing topological relationships.
  • Scale to the increasing sizes of future environments, meeting the challenges of dynamically changing heterogeneous environment utilizing common abstractions.
  • And above all, tie the viewing with the doing utilizing intelligent analysis to drive automation and reduce operational costs!
Reblog this post [with Zemanta]
  • Share/Bookmark
Categories: Management Tags: