Infinite I/O

A case for distributed content based caching on DRAM

Posted by Alan Brandon on Aug 22, 2013

by Vishal Misra

Application performance improvement is the primary reason to use storage caching as improved performance directly leads to both better user experience as well as higher economic value. Caching helps applications read and write data faster, thus removing one potential bottleneck in performance.

There are many different caching technologies and approaches — server-side caching, read caching, write-back caching, SSD-based caching, and so on. It can be a confusing array of topics, but in what Simon Sinek calls the golden circle of the why/how/what, they all fall squarely in the “what” portion. It is important to go back to the “why,” namely speeding up application performance.

Application performance, I/O latencies and storage load

Application performance is impacted by increased latencies of I/O traffic. The longer the storage system takes to respond, the worse the application performs. A centralized storage system like a NAS or SAN is extremely complex, but a queueing model abstraction works surprising well for most of them and the load/latency relationship is highly nonlinear. The following figure illustrates the impact of load on latency. For our discussion, point A is where the latency is painful, point B provides good performance, and point C is where the storage system is performing blazingly fast. As we can see, the reduction in load required to go from point A to point B on the latency curve is ab, whereas the reduction in load required to go from point B to point C is bc. The thing to note however is that the reduction in latency is highly disproportionate: AB is much higher than BC, although ab is much smaller than bc. The takeaway from this figure is that where you reduce the load on the load/latency curve is actually more important than how much you reduce the load.

content based caching 01 resized 600

Figure: The relationship between load and latency

So now we go from the “why” (speeding up application performance), to the “how.” From the little exercise above, we can reduce latencies by reducing the load on the centralized storage system. All caching technologies attempt to do that: read caching, write-back, write-coalescing, and so on. We can go from point A to B to C using a variety of different techniques. We reduce the load by a small amount (ab) to go from A to B and if the latency at B is acceptable for our application performance, then the “why” has been accomplished. However if every microsecond is important and storage is the primary bottleneck that the application faces, then you may want to go from point A to C and the reduction in load required is ab+bc. Note that this model does not distinguish between reads and writes. If you cache enough reads you slide down on the load/latency curve and write performance improves implicitly.

After the why and the how, the what

Now that we know the “how,” we need to figure out a “what.” We start with the assumption that the application is suffering, and we are at point A on the load/latency curve. If the workload we are considering is read heavy, then it is conceivable we can go all the way from A to B to C by simply deploying read caching, freeing up the storage server to handle write traffic. Additionally, if point B is good enough for our application performance then again we can have a relatively small read cache and achieve the load reduction ab that is required. If however we have a heavy write workload then we may need to employ a combination of read and write caching to achieve the desired reduction in latencies. The size of the cache is directly correlated to cache hit rates and thus the reduction in load, but again often in a very non-linear fashion. For example, measurements on a Facebook Hadoop workload shown in a paper from the University of California, Berkeley, had very heavy tailed access patterns. Most of the blocks were read only once and never again, whereas other hot blocks were read over and over again. For blocks that are not reread, caching is of no use but pre-fetching might help. A smaller cache is just as effective as a large cache for such workloads, and this is not atypical. In a nutshell, there is a law of diminishing returns with cache size and one needs to right-size a cache to achieve the optimal price/performance tradeoff.

Additionally, in any typical datacenter, the workload seen by different servers varies with time. A distributed and shared cache across these servers leverages this fluctuation in the workload in a smart way. When one server is relatively idle, the cache local to that server can be used by other servers that might have a heavier requirement at that time. Thus, the effective cache consumed by a server dynamically adjusts with the workload. Modern in-rack network latencies and bandwidth are such that the overhead of fetching the data from a remote DRAM on the rack is about the same as from a local disk or SSD. DRAM-based access is truly random and does not suffer from the unpredictability that sometimes accompanies SSDs, especially with write performance (write amplification, wear leveling, garbage collection, and so on.). Additionally, DRAMs are a commodity now and DIMMs for example are an extremely cheap resource to put in servers.

Virtualized datacenters bring about a very interesting twist to this scenario. There is a high degree of redundancy in the blocks that comprise the virtual disks, and hence there is a high degree of commonality in the working set — both between VMs on the same server as well in VMs across servers. If our cache is location based (blocks are referred to by their location in the storage system) then this degree of commonality is lost. If, however the cache is based on the content of the blocks, then we get a massive, deduplicated and distributed cache between the servers. The deduping and temporal sharing of cache space by the VMs between and across servers results in a massive increase in the effective cache size seen by the servers, and large reductions of the workloads are possible by a modest use of physical resources. Thus, under this architecture, a small amount of memory can often bring about the same acceptability in performance as SSD based caching.

Then there is an almost free win that we get with a shared, distributed cache. In the modern datacenter, VMs are moving around constantly with DRS and vMotion. With a shared distributed cache, there is no need to explicitly move the cache along with the VM, it happens implicitly and automatically!

Hence, Infinio Accelerator uses this core architecture to bring the performance that your applications need. If your storage array is hovering around point A, and your workloads are sending a significant enough number of reads to your NAS, then our storage accelerator can bring immediate relief. Even if you are at B or C, installing our accelerator can prevent you from sliding up to A, helping to delay that expensive hardware upgrade.

Topics: About Us, Talking Tech