Archive

Archive for the ‘Theory’ Category

Can Theory Shed Light On Consolidation?(II)

January 6th, 2010 vmturbo Comments

(This is the second in the series of two articles about workload consolidation in virtual machine environments.  The first article is linked here.)

The Bottom Line

There are several practical applications one can draw from various mathematical and computer science theories related to capacity and workload management of virtual machine environments.

  1. Merging workloads can yield significant acceleration of processing during periods of high utilization.
  2. Aggregating processing capacity can accelerate processing at all traffic levels.
  3. Alternatively, consolidating workloads and processing capacity can enable higher utilization of resources while keeping processing delays within target QoS bounds.
  4. These performance gains are higher, the higher the consolidation ratio. Consolidation of workloads across  a cluster yields higher gains than consolidation of a single host; and consolidation across a data center, or a cloud, yields higher gains than a single cluster. It is thus best to consider consolidation as an expansive hierarchy of resource sharing schedulers.
  5. These performance gains require that the consolidated workloads be maximally uncorrelated.

Merging Workloads and Aggregating Processing

Consider  the consolidation modes depicted in figure 1 below. Part (a) shows  k workload streams processed by dedicated processors; the heavy yellow workloads is queued for its service processor S1, even when the green and blue processors (S1, Sk) are not used by their respective lighter workloads. Part (b) shows workload merging into a single stream, sharing the pool of service processors.  An arriving processing task may now access any of the processors. This  eliminates scenarios   where tasks are queued while some processors are idle, as in dedicated processing.   pix2.1

Figure 1:  Consolidation Modes

Part  (c) shows capacity aggregation of service processors  into a single processor with the aggregate capacity.  An arriving task can utilize the entire aggregate capacity of all the service processors to accelerate its processing.

Example: Consolidation of Network Traffic

To illustrate the two consolidation modes, workload merging (1b) and capacity aggregation (1c), consider the network workloads of k=10 VMs. The service processors S1,S2…Sn…S10 represent 1GE NIC at the host. The workloads represent packet streams generated by the VMs sharing the hosts.  Workloads merging, depicted in (1b), is provided by virtual switch (vSwitch) that schedules packets, of all streams,  to the  10 NICs . This consolidation will produce multiplexing gains by permitting the streams to share the pool of processors and thus use their capacity more efficiently. The capacity aggregation, depicted in (1c), replaces the 10 1GE NICs with a single 10GE NIC. This consolidation will result in additional multiplexing gain as packets processing times are significantly reduced by applying the full aggregate processing capacity.

Consolidation Creates Multiplexing Gains

Merging workloads, as in (1b),  clearly uses the service processors more efficiently than dedicating processors to individual streams, as in (1a). Capacity aggregation provides additional efficiency; an arriving task can use the full aggregate capacity of the k processors, not just one of them.

These multiplexing gains can be measured by the respective reductions in delays, as defined in part I:

GM =Tdedicated/Tmerged

where Tdedicated is the average processing delay  of the dedicated streams of (1a), while  Tmerged is the average delay of the workload merging consolidation(1b).  Similarly

GA =Tdedicated/Taggregated

where Taggregated is the average delay of the capacity aggregation consolidation (1c).

What are the  multiplexing gains GM and GA of the two consolidation modes?

We will now use elementary queueing theory to address this question and gain further insights into consolidation modes.

Brief Intro to Queueing Analysis of Workload Processing

Queueing theory describes a processing service in terms of three parameters: “A/B/k”  where A describes the statistical distribution of the jobs inter-arrival times (workload); B described the distribution of service processing times; and k describes the number of servers. Additional assumptions may describe statistical independence of arrivals and service, the queueing discipline  (e.g., FIFO, LIFO ) and various service features.

For example, the  M/M/1 queueing service model describes a single server system handling a workload stream with statistically independent exponentially distributed inter-arrival times and service times. The letter “M”  indicates the memorylessness  of the exponential distribution: the future is independent of its past. M/M/k models provide a standard base to explore the performance of a workload processing services.

The exponential distribution is characterized by its rate. We use  L (load) to indicate the arrival rate and C (capacity) to indicate the service rate. The utilization of the service is defined as U=L/C,  the fraction of capacity used by arriving workload. We note in passing that when one merges k streams with exponentially distributed arrivals at rate L, the merged stream also has exponentially distributed arrivals at rate kL.

The performance of the three models, depicted in figure 1, may be analyzed using M/M/k  analysis. Assume that the workload arrivals  are  exponentially distributed with rate L; while service times are exponentially distributed with rate C. Figure (1a) depicts k independent M/M/1 systems. Figure (1b) depicts a merging of these k systems into a single M/M/k system, with arrival rate of kL. Figure (1c) depicts an M/M/1 system with aggregate arrival rate of kL and aggregate service rate of kC.  Note that the utilization U=L/C=kL/kC remains similar in all models. However, processing delays will vary greatly.

Queueing Analysis of  The Multiplexing Gains

We are now ready to analyze the multiplexing gains of the workload merging and capacity aggregation consolidation modes, depicted in figure 1.

The average delay of an M/M/1 stream, as in (1a),  is given by Tdedicated=1/(C-L).

The average delay of an M/M/k stream, as in (1b), is given by

Tmerged=(1/C)+P(k,U)/(kC-L)    where P(k,U) is the probability of queueing

For U~0 (low utilization) P(k,U)~0 and thus T0merged~1/C ; that is, the waiting time is approximately the service time. For U~1 (high utilization) P(k,U)~1 and thus T1merged~1/(kC-L).

Therefore, when U~0 (light load): Gmerged=Tdedicated/T0merged= C/(C-L)=1/(1-U) ~1;  there is no gain in merging workloads to  share processors. However, when U~1 (heavy load):  Gmerged=Tdedicated/T1merged=(kC-L)/(C-L)=(k-U)/(1-U)~k; the multiplexing gain is proportional to the consolidation ratio k.

To summarize, Gmerged~1 for light utilization, while for high utilization  Gmerged~k .

For example, if  network traffic of 10 VMs is merged by a vSwitch into 10 NICs, during heavy traffic the multiplexing gains will converge to the consolidation ratio 10. When traffic is light, gains will be minimal.

Consider now the aggregated stream of figure (1c). The average delay  is given by Taggregated=1/(kC-kL)=(1/k)Tdedicated

Therefore GA =Tdedicated/Taggregated =k

For example, suppose network traffic of 10 VMs is aggregated into a 10GE NIC . The traffic processing delays will be reduced by a factor of 10 over using 10 dedicated 1GE NICs.

Now compare workload merging with capacity aggregation modes. Under light traffic  GM ~1, that is, merging produces no gains over dedicated resources. In contrast, GA =k and thus, even under light traffic, processing is greatly accelerated. However, for high utilization GM ~GA ~k. That is, both merging provides similar gains as capacity aggregation.

Finally, one should keep in mind that theory is, at best, a valuable approximation of reality. The M/M/k analysis above is based on assumptions that may, or may not, be valid for specific real scenarios. For example, workloads may be correlated, violating the assumptions of M/M/k models. Therefore, it is best to view the theory as a quantitative underpinning of the qualitative performance behaviors and focus on these.

In summary consolidation of k  M/M/1 systems can reduces processing delays through periods of high-utilization,  by a factor similar to the consolidation ratio.

Applications

We now apply the theoretical considerations above to several resource sharing scenarios in virtualization systems.

Memory Sharing

Memory pages may be considered as processors of workloads demands. If these pages are partitioned among VMs — i.e., through reservations — the resulting dedicated processing of memory requests is described by figure (1a).  Such memory reservations can result in great inefficiencies as some VMs queue their memory pages in swap areas, while other VMs are under-utilizing their memory shares. This can have dramatic impact on performance. Therefore, it is desirable to permit sharing of unused memory. Such sharing can be handled by allocating the pool of unused memory according to “shares” of respective VMs.

If   k is the number of pages in the shared pool, the M/M/k model of figure (1b) provides a useful  approximation of this merging of memory requests streams. When utilization is high, one can expect the multiplexing gains in access delays to improve by a factor of k, over dedicated reservations. Unfortunately, when utilization increases the size of the shared pool tend to decrease and the gains in access speeds may be offset by swapping overheads. Therefore, it is useful to max the size of the shared pool to rip the benefits of consolidation. The ballooning mechanisms of VMWare are an example how the virtualization system can “steal” unused dedicated memory to increase the shared pool.

Storage IO  Sharing in SANs

Storage IO streams share the processing capacity of an HBA and SAN fabric behind it. Storage protocols partition the HBA capacity among IO channels to the storage array. Such partition of IO processing exhibits the performance of  dedicated streams as in figure (1a).

Virtualization systems, however, often merge the IO streams of multiple VMs into a single IO channel. If all VMs IO traffic is consolidated into a single IO channel, the resulting stream can exhibit the performance gains of the capacity aggregation mode of  figure (1c).  On the surface such consolidation would max the multiplexing gains.  Unfortunately, consolidation of storage IO can lead to inconsistencies with the storage array mechanisms. For example, merging of storage IO operations streams may disrupt sequential access, leading to significant slow down of storage access.

Therefore, consolidation should not be considered panacea. The benefits of multiplexing gains may be offset by limitations of other mechanisms. In the case of storage IO, the root obstacle to consolidation is merging IO streams into a single IO channel. The semantic of a “channel” expected by a storage array is inconsistent with merged streams.  The problem can be solved by maintaining separate channel identifier for each stream (virtual channel), yet sharing the underlying physical processing capacity among them. This permits the storage array to optimize the processing of each virtual channel, while benefiting from the multiplexing gains in using underlying processing resources.

Maxing Utilization vs. Delay

The discussion above focused on multiplexing gains in reducing delays. More often, one is interested in using consolidation to max utilization while keeping delays within a given target range.

This, however, can be resolved by a simple twist of the analysis above. Consider for example the capacity aggregation of figure (1c). Suppose one merges the k workload streams but aggregates the capacity of only m<k processors. The utilization of the consolidated system is U*=kL/mC=(k/m)U. Thus, the utilization gain is U*/U=k/m.

The average delay contracts from T=1/(C-L) to T*=1/(mC-kL). So G=T/T*=(mC-kL)/(C-L)=(m-kU)/(1-U).  Suppose one wishes to double the average utilization by using m=k/2. Then G=k(1/2-U)/(1-U). When U is sufficiently small, G~k/2; so utilization can be doubled, while keeping significant multiplexing gains proportional to half the consolidation ratio.

Reblog this post [with Zemanta]
  • Share/Bookmark

Can Theory Shed Light on Workload Consolidation? (I)

November 19th, 2009 vmturbo Comments

This article is the first in a series devoted to the queueing-theoretic foundations of workload consolidation and its applications to virtualization.

Consider 3 physical machines, each dedicated to process the workload of a respective application, as depicted in figure 1 below. The workload streams arrive in bursts of processing requests, depicted by the rectangles on the arrival time-lines. The rectangles lengths depict the durations of bursts, while their heights depict the amount of processing  to service these bursts. These workload streams are queued until the service processors, depicted by ellipses, are available to process their requests. The service processors may be providing CPU, memory, storage I/O bandwidth, or network bandwidth.

dedicated

Figure 1:  Dedicated Workload Processing

Suppose the three applications are consolidated into virtual machines (VMs) executing on a single host.  Consolidation can generally involve two independent actions: (a) merging  the three workloads into a single stream; and (b) pooling the three service processors  into a single processor.  For example, the service processors may be 3 separate Network Interface Cards (NICs) which are pooled, through a virtual switch, into a single service processor handling the aggregate network traffic of the 3 applications. This aggregation of workloads and processors is depicted in the figure below.

ConsolidatedQ

Figure 2:  Consolidated Workload Processing

Consolidation permits workloads to share the full service processing capacity and thus create statistical multiplexing gains.  For example, the “yellow” workload of figure 1 is queued, waiting for its dedicated processor. At the same time, the processor dedicated to the orange workload  is idle. Dedicated processing does not permit the yellow workload to share the idle capacity dedicated to the orange workload. In contrast, the consolidated service of figure 2 permits any  processor to be used by any workload. The yellow workload can tap the aggregate processing power of all processors and accelerate its processing.

How much multiplexing gain can consolidation provide?

In general, let T denote the average delay of dedicated processing streams and let T* denote the average delay of their consolidated stream. The multiplexing gain in delay is defined as G=T/T*. How large can the delay gain G be?

The precise answer clearly depends on the statistical details of the system. Consider first an extreme scenario, depicted in figure 3 below. The yellow and green streams generate clustered bursts that produce long queues and delays. The orange stream is using its processor very lightly. The consolidated workload  streamlines traffic evenly and uses the collective power of the 3 processors to eliminate queues and delays; thus T*=0. The multiplexing gain G=T/T*=T/0 is infinite.

3muxgains,jpgFigure 3:  Multiplexing Gains Reduce Queueing Delays

Figure 4,  in contrast, depicts a scenario with no multiplexing gains. The 3  streams consist of fully synchronized burst arrivals; all three processors are either busy, or idle, at the same time.  The delay T* experienced by the consolidated stream is the same as the delay T of dedicated streams . The multiplexing gain in delay is G=T/T*=1. That is, there is no multiplexing gain.

4nomuxgains

Figure 4:  No Multiplexing Gains

More generally, multiplexing gains depend on the specific statistics of the workload. The example of figure 3 accomplished maximal gain because workload bursts were uncorrelated in time  and among the individual streams. The example of figure 4 produced no gains because workload bursts of individual streams were perfectly correlated.

Intuitively speaking, multiplexing gains obtain when the workloads of  the multiplexed streams are minimally correlated.

In what follows we will see that, when the workload statistics indeed minimizes correlations in time and among streams, the multiplexing gain can be proportional to the consolidation ratio.

It  is useful to first illustrate some of the practical implications of this rule.

Example 1: Consolidation of Network Traffic

Network traffic of VMs is typically aggregated by a virtual Switch (vSwitch).  The vSwitch schedules the merged traffic streams into respective physical NICs. Network traffic shares the aggregate capacity of the NICs.  Network traffic streams are typically minimally correlated. Therefore, aggregation of NICs and traffic typically yields significant multiplexing gains, over dedicated NICs.

Example 2: Load Balancing

Load balancing is  essentially a consolidation mechanism. The  load balancer creates a pool of the processing resources and schedules their allocation  to the aggregate stream. The load balancer incurs multiplexing gains to the extent that the workload streams are minimally correlated.

The size of the multiplexing gains can be proportional to the amount of resources pooled by the load balancer. For example,  the Distributed Resource Scheduler (DRS) of VMware,  consolidates the resources of a cluster and can thus  yield substantially higher multiplexing gains than consolidation of workloads into a single physical host. By the same token, a load balancer that pools the resources of an entire data-center, or a cloud, can yield substantially higher gains than balancing loads over  a single cluster.

Example 3: Database Applications Services

Consider a virtualized database server consolidated with its applications to share the same physical host. Suppose the applications read/write files based on data manipulations by the database servers. Thus, for example, the database may retrieve a large table into memory and then provide an application with large share of this data, which it uses to update large files it retrieves from storage.

These computational activities can create high degree of correlations among the respective workloads. For example, the application may generate a burst demand for memory in order to store the data provided by the server, just when the database server produces a burst demand for memory to retrieve sections of the table. Similarly, the application will generate a burst demand for storage IO, to read its large files, just when the database server is producing a burst demand for IO to retrieve the tables.

These correlations can reduce, even eliminate, multiplexing gains. Worse, as we will discuss in subsequent posts, the correlations may result in increasing the workload peak bursts saturating the respective resources.

Clearly, to avoid such contention for shared resources one needs to best assure that correlations are  minimized. For example, one may relocate database applications involving high-degree of I/O into a separate physical host.


We continue this discussion in Part II, where we provide quantitative analysis of these multiplexing gains and consider their additional applications.

Reblog this post [with Zemanta]
  • Share/Bookmark