July 9th, 2014

Storage I/O latencies – how fast should be fast?

by

Earlier we discussed what might slow down applications that perform their main functions using CPU and memory. This what every application does – it processes information in memory using CPU to manipulate bits.  But data has to be loaded into main memory from somewhere and the results need to be sent out. Input/Output is the heart of any modern computing environment; no single computing system can function without I/O.

Let’s look at one particular aspect – storage I/O and what challenges virtualization adds to this already complex subject. The computing industry has made tremendous progress in developing very fast storage solutions – high speed disk drives, storage area networks, efficient file systems – all to deliver the information as fast as possible. Many of these solutions are very sophisticated and smart – disk drives could be striped together to improve performance, equipped with a front-end cache which will store frequently used pieces of data, modern file systems can optimize the block layout to factor in disk rotation latency and predict which blocks would be needed so there will be very little delays.

However, many of these advanced solutions were implemented for the physical world where programs and data were mostly isolated from each other and optimization didn’t take into account any workload interference.

Let’s look at an application running inside a virtual machine and try to analyze what may slow down its storage I/O. First, it needs to access a local file system that these days is just a file (e.g. VMDK) residing on some virtualized data stores. An operating system thinks it is just a file system using a regular disk, it optimizes disk layout and block reading assuming there are disk heads reading blocks from a rotating disk surface. It knows a disk rotational speed and tries to read a block that is flying under a reading head. Very smart.  However, what it doesn’t know that this block now lives in a file that is stored in a much larger disk, which is handled by a virtualized file system like VMFS that stores VMDK files. The real rotation happens there, and if a local file system tries to optimize it will likely miss the needed block that may cause some delays.

VMFS may perform its own optimization and take into account rotational latencies of the real disks. But it is not workload-demand aware; every block read is independent on each other. A single virtual datastore is shared among hundred of VMs performing concurrent read-writes. All that sophisticated I/O optimization goes down the drain as such concurrent work creates a so-called “I/O blender” where optimized sequential reads are transformed into random block reads sending disk heads back and forth across the entire disk surface, thus eliminating any effects of file system I/O optimization.

This can cause visible I/O latencies and the challenge is that there is no single offending party that could be found and eliminated. A disk IOPS capacity is adequate; it is just a nature of concurrent I/O requests against the shared virtual datastore. The only remedy is to place fewer I/O peaking VMs on the same datastore, thus reducing efficiency. Or, understanding the workload demand and peaks and averages, smartly place VMDK files across multiple datastores to reduce peaking VMs interfering with each other. But that cannot be done in advance as the workload demand changes every second…

But this is only one step in delivering the data to the application. Once a block is read it needs to be handled by a storage circuitry inside a host, commonly called HBA (host bus adapter). It is responsible for sending and receiving I/O requests between memory and disks. As memory and disks have very different speeds, an operating system implements sophisticated buffering mechanisms that queue up I/O requests. If multiple VMs running on the same host sending requests to the same LUN, its attached HBA queue will start growing. While an I/O request stays in the queue, an application that sent it has to wait – adding another delay. How to minimize this delay – spread VMs across multiple hosts to reduce the length of the queue. But which VMs to place where? One needs to know how much I/O every VM requests and when it peaks, so I/O hungry peaking VMs must be kept on different hosts. Again, one cannot plan this in advance, as workload demand changes every second. And by the way, when VMs are placed across different hosts to optimize I/O let’s not forget a challenge we already discussed – memory and CPU access has to be taken into account.

These are just 2 small pieces of a much larger I/O puzzle IT managers have to solve every day. We didn’t even touch what SSD and server side caching brings to the table, or how converged fabric where network and storage traffic are mixed in the same backplane may slow down I/O. But even if we look at only these 2 pieces it is fairly obvious that a) it is practically impossible to predict and plan optimization in advance and b) without being workload-demand aware no infrastructure optimization will help minimize latencies. Even if the number of VMs per datastore is reduced, they could be very I/O intensive and peak together.

Now imagine that you are notified about such problem existence by receiving an alarm from a virtual desktop that its application response time is more than 200 ms. Where would you go? To a shared datastore with too many VMs? To a shared host with long HBA queues? To an overloaded converged fabric? Do you need a faster disk drive or a switch? How fast? Do you need a better solution?

Related Articles:

http://vmturbo.com/about-virtualization/virtualization/memory-management-if-memory-serves-right/

http://vmturbo.com/about-virtualization/virtualization/virtualization-best-practices-house-cards/