OpenZFS: Isolating ZIL Disk Activity
I recently completed a project to improve the performance of the OpenZFS
ZIL (see here for more
details); i.e. improving the performance of synchronous activity on
OpenZFS, such as writes using the O_SYNC flag. As part of that work,
I had to run some performance testing and benchmarking of my code
changes (and the system as a whole), to ensure the system was behaving
as I expected.
Early on in my benchmarking exercises, I became confused by the data
that I was gathering. I was expecting only a certain number of writes to
be active on the underlying storage devices at any given time, based on
the known workload that I was applying to the zpool, and based on my
understanding of the ZIL’s mechanics. When running these known workloads
and inspecting the actv column from iostat though, I was
consistently seeing more write activity on the devices than I expected.
At this point, I was starting to question my understanding of the code that I had written, and my understanding of the ZIL’s mechanics as a whole. Since I knew exactly the IO workload that was being applied to the system, why wasn’t it behaving as I had predicted?
After scratching my head and consulting the code numerous times, I asked
Matt Ahrens if he had any clues as to what might be going on. Matt was
quick to remind me that I was failing to incorporate the IO that would
occur as part of spa_sync() into my mental model. Additionally, he
suggested that since it would be difficult to know exactly how many IOs
to expect from spa_sync(), which then makes it difficult to verify the
actv column from iostat w.r.t. my code changes, I should configure
the system to effectively disable spa_sync() altogether. This way, all
of the IOs that would be active on the disk would be a result of a ZIL
write, which is exactly what I was previously expecting.
To acheive this configuration, Matt pointed me at the following kernel
perameters: zfs_dirty_data_max, zfs_dirty_data_sync, and
zfs_txg_timeout. Basically, I had to set all of these values such that
the dirty limit would never be reached, and thus, a TXG sync would never
trigger as a result of the amount of dirty data my workload generated.
Since my test system had 128 GB of RAM, I used the following
commands/values to achieve this configuration:
$ sudo mdb -kwe 'zfs_dirty_data_max/z 0t68719476736'
$ sudo mdb -kwe 'zfs_dirty_data_sync/z 0t34359738368'
$ sudo mdb -kwe 'zfs_txg_timeout/z 0t3600'
It’s also important to note that these values are dependent on the rate at which my workload would dirty data (i.e. the workload’s write throughput), and the duration of the test. I ensured that the workload would not be able to dirty enough data to cause TXG sync prior to the test completing. With all of this configured correctly, the only way writes would get issued to disk would be via the ZIL, which is exactly what I wanted.
Additionally, I further tuned the system to disable the IO aggregation that ZFS may do when there’s sufficient write activity to warrant it:
$ sudo mdb -kwe 'zfs_vdev_aggregation_limit/z 0t0'
While this setting didn’t help my workload’s throughput, it did help me validate the correctness of my code changes.