OpenZFS: Artificial Disk Latency Using zinject

About a year ago I had the opportunity to work on a small extension to the OpenZFS zinject command with colleagues Matt Ahrens and Frank Salzmann, during one of our Delphix engineering wide hackathon events. Now that it’s in the process of landing in the upstream OpenZFS repository, I wanted to take a minute to show it off.

To describe the new functionality, I’ll defer to the help message:

$ zinject -h
... <snip> ...

zinject -d device -D latency:lanes pool

        Add an artificial delay to IO requests on a particular
        device, such that the requests take a minimum of 'latency'
        milliseconds to complete. Each delay has an associated
        number of 'lanes' which defines the number of concurrent
        IO requests that can be processed.

        For example, with a single lane delay of 10 ms (-D 10:1),
        the device will only be able to service a single IO request
        at a time with each request taking 10 ms to complete. So,
        if only a single request is submitted every 10 ms, the
        average latency will be 10 ms; but if more than one request
        is submitted every 10 ms, the average latency will be more
        than 10 ms.

        Similarly, if a delay of 10 ms is specified to have two
        lanes (-D 10:2), then the device will be able to service
        two requests at a time, each with a minimum latency of
        10 ms. So, if two requests are submitted every 10 ms, then
        the average latency will be 10 ms; but if more than two
        requests are submitted every 10 ms, the average latency
        will be more than 10 ms.

        Also note, these delays are additive. So two invocations
        of '-D 10:1', is roughly equivalent to a single invocation
        of '-D 10:2'. This also means, one can specify multiple
        lanes with differing target latencies. For example, an
        invocation of '-D 10:1' followed by '-D 25:2' will
        create 3 lanes on the device; one lane with a latency
        of 10 ms and two lanes with a 25 ms latency.

... <snip> ...

Given that explanation, lets dive in and give this a test drive with a few examples.

Setup and Baseline

First, lets start with a basic example using a pool with a single disk:

# zpool create tank c1t1d0
# zfs create tank/foo

Now let’s do some IO to the pool and use dtrace to examine the latency of the requests to get a baseline expectation for the IO latency of writes to this pool.

Running in one terminal I have a simple, single threaded write workload running in an infinite loop. This will continually copy some random data to a file in the pool, remove that file, and repeat.

# dd if=/dev/urandom of=/tmp/urandom-cache bs=1M count=1024 &>/dev/null
# while true; do
> cp /tmp/urandom-cache /tank/foo/urandom-cache
> sync
> rm /tank/foo/urandom-cache
> sync
done

Now, lets look at the latency of the write requests issued to disk by the filesystem, using dtrace (see here for the source of this dtrace script):

# dtrace -qs zio-rw-latency.d -c 'sleep 60' tank write

histogram of latencies (us):

           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |                                         248
             512 |@@@@@                                    6204
            1024 |@@@@@@@@@@@@@@@@@@@@@@                   26218
            2048 |@@@@@@@@@@@                              13070
            4096 |@@                                       2474
            8192 |                                         416
           16384 |                                         56
           32768 |                                         7
           65536 |                                         0

avg latency (us)   stddev (us)   iops   throughput (KB/s)
2025               1609          811    99623

This is showing us a histogram of the latency of all writes that were submitted and completed to our pool during that specific interval. Additionally, it’s showing us an average and standard deviation of the latency over all the requests, number of request issued and completed per second, and the throughput achieved by these writes.

This sets the baseline performance that we expect to get out of this storage configuration; most requests take less than 8 ms to be serviced, with the average being about 2ms with a standard deviation of roughly 1.5ms.

First Experiment: Single Disk, Single Lane

Now that we have some baseline numbers to compare against, lets use the new zinject functionality to slow down the requests. Let’s say we want to delay all requests such that they take a minimum of 10ms to complete.

Remembering that we created this pool in the previous section using a single disk named c1t1d0, we can set the minimum latency of this disk to be our desired 10ms:

# zinject -d c1t1d0 -D 10:1 tank
Added handler 1 with the following properties:
  pool: tank
  vdev: c138fa4fbf3d8f2

And to double check that this is the only delay registered on our pool, we can run zinject without any parameters to dump a list of all registered delays:

# zinject
 ID  POOL             DELAY (ms)       LANES            GUID
---  ---------------  ---------------  ---------------  ----------------
  1  tank             10               1                c138fa4fbf3d8f2

The “lanes” column describes the number of requests that can be serviced simultaneous by each registered delay; thus this delay specifies that only a single request can be serviced at a time, and each request will take a minimum of 10ms to complete.

Now, with that delay in place, and the same write workload from the previous section still running, we can again use dtrace to inspect the write request latencies, IOPs, and throughput values. Given this delay, we should expect to see no request complete sooner than 10ms, and as a result, the IOPs and throughput should take a significant hit:

# dtrace -qs zio-rw-latency.d -c 'sleep 60' tank write

histogram of latencies (us):

           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             4177
           16384 |@@@@@@                                   894
           32768 |@@@@@                                    807
           65536 |                                         1
          131072 |                                         0

avg latency (us)   stddev (us)   iops   throughput (KB/s)
17080              12448         97     12518

Great! This matches our expectation.

Since we’re using a logarithmic histogram, we can’t actually differentiate between requests in the 8-16ms range, but judging by the fact that there aren’t any request in the 0-8ms range, it’s safe to assume all the request that landed in the 8-16ms range, actually took 10-16ms to complete (as they should have).

Wait, What About the Average Latency?

At this point, you might be looking at the average latency output from the previous experiment, and be wondering why it’s so much larger than our target latency of 10ms. Without the delay, most requests would complete in under 8ms, so why isn’t the average latency with the delay closer to the desired 10ms?

Remember that when we created the delay we specified it to have a single “lane”. What this means is only a single request can be serviced at a time. If more than a single write is submitted every 10ms to the underlying disk, then each new write request will get backed up by any existing requests.

In other words, each lane represents a FIFO queue. When a new IO request is submitted, the FIFO queue with the shortest wait time will be selected (e.g. if there are multiple lanes available to choose from), and then the request is placed in the queue and waits its turn until it can be processed and eventually completed. Thus, in our example above, if a request is submitted while another request is currently being processed and still has 7ms to finish, this new request’s latency will be a minimum of 17ms; 7ms spent waiting for the prior request to be processed, and 10ms spent being processed itself. Due to the way OpenZFS issues writes, multiple requests will certainly be submitted simultaneously, so it’s common for this situation to occur and explains the average latency in the previous experiment.

Second Experiment: Single Disk, Multiple Lanes

If the previously described behavior isn’t actually what is desired, we can simulate an implementation where requests pay no attention to prior requests by adding a delay with enough lanes such that there will always be an open lane to accept new requests. As an example, let’s create a delay with the same 10ms minimum latency as before, but instead of 1 lane, we’ll create 16 lanes:

# zinject -c all
removed all registered handlers

# zinject -d c1t1d0 -D 10:16 tank
Added handler 2 with the following properties:
  pool: tank
  vdev: c138fa4fbf3d8f2

# zinject
 ID  POOL             DELAY (ms)       LANES            GUID
---  ---------------  ---------------  ---------------  ----------------
  2  tank             10               16               c138fa4fbf3d8f2

Assuming the magic number of 16 is sufficient to prevent new requests from piling up behind older, in-progress requests, we should now see the average latency trend much closer to the specified 10ms:

# dtrace -qs zio-rw-latency.d -c 'sleep 60' tank write

histogram of latencies (us):

           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8836
           16384 |                                         39
           32768 |                                         11
           65536 |                                         1
          131072 |                                         1
          262144 |                                         0

avg latency (us)   stddev (us)   iops   throughput (KB/s)
10261              3022          148    18463

Perfect! As expected, the average latency is now much closer to our desired 10ms. Unfortunately though, there were 52 requests that actually took longer than 16ms to be completed by the underlying storage (2 of which took longer than 64ms!), which pulls the average slightly above the specified target of 10ms.

Third Experiment: Setup and Baseline

Another interesting example to demonstrate is simulating a pool with disks of different latency characteristics. This can happen for a variety of reasons, none of which I’ll touch on in this article. To simulate this case, we’ll need a pool with two disks:

# zpool create tank c1t1d0 c1t2d0
# zfs create tank/foo

We’ll use a slightly modified version of the write workload used in previous sections; instead of a single thread, we’ll use 8 threads to increase the amount of data being written and increase the amount of simultaneous write requests occurring:

# while true; do
> PIDS=""
> for i in $(seq 1 8); do
>     cp /tmp/urandom-cache /tank/foo/urandom-cache-$i &>/dev/null &
>     PIDS="$PIDS $!"
> done
> sync
> for pid in $PIDS; do
>     wait $pid
> done
>
> PIDS=""
> for i in $(seq 1 8); do
>     rm /tank/foo/urandom-cache-$i &>/dev/null &
>     PIDS="$PIDS $!"
> done
> sync
> for pid in $PIDS; do
>     wait $pid
> done
done

And with a slight variation to the previous dtrace script, we can show the same statistics as the prior sections, but individually for each vdev in the pool (see here for the source of this dtrace script):

# dtrace -qs vdev-rw-latency.d -c 'sleep 60' tank write

histogram of latencies (us):

f1d32d9b9514d7a3
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |                                         43
             512 |                                         665
            1024 |@@                                       3231
            2048 |@                                        1848
            4096 |@@@@@@@@@@@@@                            17886
            8192 |@@@@@@@@@@@@@@@@@@@@                     26511
           16384 |@@@                                      3493
           32768 |                                         310
           65536 |                                         4
          131072 |                                         2
          262144 |                                         0

f4eb4119d7170682
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |                                         41
             512 |                                         627
            1024 |@@                                       3207
            2048 |@                                        1865
            4096 |@@@@@@@@@@@@@                            17809
            8192 |@@@@@@@@@@@@@@@@@@@@                     26416
           16384 |@@@                                      3570
           32768 |                                         312
           65536 |                                         9
          131072 |                                         1
          262144 |                                         0

GUID               avg latency (us)   stddev (us)   iops   throughput (KB/s)
f1d32d9b9514d7a3   9475               5292          899    111305
f4eb4119d7170682   9496               5298          897    111116

As expected, without any delays injected with zinject each vdev performs nearly identically.

Third Experiment: Multiple Disks, Different Latencies

Now that we’ve verified the vdevs to be performing similarly in the absence of any artificial delays, we can simulate a case where one disk performs half as slow as the other by specifying two different zinject delays for the different disks:

# zinject -d c1t1d0 -D 16:16 tank
Added handler 3 with the following properties:
  pool: tank
  vdev: f4eb4119d7170682

# zinject -d c1t2d0 -D 32:16 tank
Added handler 4 with the following properties:
  pool: tank
  vdev: f1d32d9b9514d7a3

# zinject
 ID  POOL             DELAY (ms)       LANES            GUID
---  ---------------  ---------------  ---------------  ----------------
  3  tank             16               16               f4eb4119d7170682
  4  tank             32               16               f1d32d9b9514d7a3

In this scenario we’d expect the total throughput to take a significant hit, resulting from the large delays, but we’d also expect the faster disk to service twice the number of write requests in any given interval. Thus, we should see the faster disk demonstrate roughly twice the throughput when inspected with dtrace:

# dtrace -qs vdev-rw-latency.d -c 'sleep 60' tank write

histogram of latencies (us):

f4eb4119d7170682
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  29964
           16384 |@                                        1012
           32768 |                                         117
           65536 |                                         4
          131072 |                                         0

f1d32d9b9514d7a3
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         26
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16104
           32768 |                                         162
           65536 |                                         5
          131072 |                                         0

GUID               avg latency (us)   stddev (us)   iops   throughput (KB/s)
f4eb4119d7170682   16391              2128          518    64651
f1d32d9b9514d7a3   32310              1680          271    33630

Which is exactly what I’ve measure above! Disk f4eb4119d7170682 was given the 16ms delay and was measured as having nearly double the IOPs and nearly double the throughput during the measured interval, success!

DTrace Scripts

In the spirit of full disclosure, here’s the source for the dtrace scripts that I used in the previous sections.

ZIO Latency

Here’s the script used in the first two experiments which does not break down the statistics based on vdev:

# cat zio-rw-latency.d
BEGIN
{
        start = timestamp;
}

fbt:zfs:vdev_disk_io_start:entry
/this->zio = args[0],
 this->zio->io_type == ($$2 == "read" ? 1 : 2) &&
 stringof(this->zio->io_spa->spa_name) == $$1/
{
        ts[this->zio] = timestamp;
}

fbt:zfs:zio_interrupt:entry
/this->zio = args[0], ts[this->zio]/
{
        delta = (timestamp - ts[this->zio])/1000;

        @i = count();
        @a = avg(delta);
        @s = stddev(delta);
        @l = quantize(delta);
        @t = sum(this->zio->io_size);

        ts[this->zio] = 0;
}

END
{
        printf("\nhistogram of latencies (us):");
        printa(@l);

        normalize(@i, (timestamp - start) / 1000000000);
        normalize(@t, (timestamp - start) / 1000000000 * 1024);

        printf("%-16s   %-11s   %-4s   %-17s\n",
            "avg latency (us)", "stddev (us)", "iops", "throughput (KB/s)");
        printa("%@-16u   %@-11u   %@-4u   %@-17u\n", @a, @s, @i, @t);
}

This script is intended to take two parameters: $$1 is the pool name, i.e. tank in our examples, and $$2 is either "read" or "write", we only used "write" in the examples in this article.

VDEV Latency

Here’s the script used in the last experiment, which differentiates the statistics based on the individual vdevs in the pool:

# cat vdev-rw-latency.d
BEGIN
{
        start = timestamp;
}

fbt:zfs:vdev_disk_io_start:entry
/this->zio = args[0],
 this->zio->io_type == ($$2 == "read" ? 1 : 2) &&
 stringof(this->zio->io_spa->spa_name) == $$1/
{
        ts[this->zio] = timestamp;
}

fbt:zfs:zio_interrupt:entry
/this->zio = args[0], ts[this->zio]/
{
        delta = (timestamp - ts[this->zio])/1000;

        @i[this->zio->io_vd->vdev_guid] = count();
        @a[this->zio->io_vd->vdev_guid] = avg(delta);
        @s[this->zio->io_vd->vdev_guid] = stddev(delta);
        @l[this->zio->io_vd->vdev_guid] = quantize(delta);
        @t[this->zio->io_vd->vdev_guid] = sum(this->zio->io_size);

        ts[this->zio] = 0;
}

END
{
        printf("\nhistogram of latencies (us):\n\n");
        printa("%x %@u\n", @l);

        normalize(@i, (timestamp - start) / 1000000000);
        normalize(@t, (timestamp - start) / 1000000000 * 1024);

        printf("%-16s   %-16s   %-11s   %-4s   %-17s\n", "GUID",
            "avg latency (us)", "stddev (us)", "iops", "throughput (KB/s)");
        printa("%-16x   %@-16u   %@-11u   %@-4u   %@-17u\n", @a, @s, @i, @t);
}

This script takes the same $$1 and $$2 parameters as the previous zio-rw-latency.d script described above.

Why Not Use DTrace’s “io” Provider?

Inevitably there will be at least one curious reader that looks at the above dtrace scripts and wonders why I didn’t use the stable io provider (i.e. io:::start and io:::done). This is intentional because the delays injected using zinject occur above the level of the io provider.

Essentially, what’s happening is the underlying storage will complete the request as fast as it can, and the io provider will report on this latency. After the storage has completed the request, it will return the result back up to OpenZFS, and it is at this point that the request may be delayed further if using this new zinject feature. Once the result is passed back to OpenZFS, the latency has already been recorded when using the io provider, so any additional latency introduced by zinject will go unnoticed.

I can easily demonstrate this with the following example:

# zpool create tank c1t1d0
# zfs create tank/foo

# while true; do
> cp /tmp/urandom-cache /tank/foo/urandom-cache
> sync
> rm /tank/foo/urandom-cache
> sync
> done &
[1] 9388

# zinject -d c1t1d0 -D 10:1 tank
Added handler 5 with the following properties:
  pool: tank
  vdev: e2503467ad6a2e62

# zinject
 ID  POOL             DELAY (ms)       LANES            GUID
---  ---------------  ---------------  ---------------  ----------------
  5  tank             10               1                e2503467ad6a2e62

So, we have a pool with a single disk and a delay of 10ms with a single lane. Lets look at the write request latency on the system as reported by the io provider:

# dtrace -qs io-rw-latency.d -c 'sleep 60' write


           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |                                         26
             512 |@@                                       266
            1024 |@@@@@@@@@@@@@@@@@@@@                     2208
            2048 |@@@@@@@@@@                               1079
            4096 |@@@@                                     461
            8192 |@@                                       206
           16384 |@                                        54
           32768 |                                         10
           65536 |                                         1
          131072 |                                         0

It’s clear that the latency reported here does not match the 10ms minimum latency requested by the zinject command; and to verify that the delay is actually working correctly, let’s use the dtrace script from the prior sections (this one):

# dtrace -qs zio-rw-latency.d -c 'sleep 60' tank write

histogram of latencies (us):

           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@@@@@@@@@@@@@@@@@                       2606
           16384 |@@@@@@                                   898
           32768 |@@@@@@@@@                                1308
           65536 |@@@@@@@                                  1095
          131072 |                                         0

avg latency (us)   stddev (us)   iops   throughput (KB/s)
33768              26629         98     12573

Clearly, the delay is in place, it’s just that the io provider doesn’t detect the additional latency introduced by zinject.

And finally, here’s the source for the dtrace script used above, which uses the io provider:

# cat io-rw-latency.d
io:::start
/args[0]->b_flags & ($$1 == "read" ? B_READ : B_WRITE)/
{
        ts[args[0]->b_edev, args[0]->b_lblkno] = timestamp;
}

io:::done
/ts[args[0]->b_edev, args[0]->b_lblkno]/
{
        @ = quantize((timestamp -
            ts[args[0]->b_edev, args[0]->b_lblkno]) / 1000);
        ts[args[0]->b_edev, args[0]->b_lblkno] = 0;
}