Looking at a pool with a ssd based separate intent log with a workload that
runs N clients each doing 256 concurrent synchronous ~8K writes over NFS.
All using the same FS from a single pool.
Normally, I see peaks of 20K+ writes per second (with 2 slog devices) on the server and ~2K zil_commit_writer per second. However during an spa_sync() we get much fewer zil_commit_writers; the latency of the calls go up considerably. This in turn causes more zfs_write to aggregate per zil_commit_writer which is normal.
The problem is a disproportionately large amount of zio_wait : if we're handling 3X more work per zil_commit_writer; we're waiting 30X more in zio_wait().
The attched script latoff3.d reports the avg latency of zil_commit_writer and
the avg number of writes per zil_commit_writer. We also report the avg time spent
in zio_wait. The more writes per commit the more wait is expected (time to drain memory to slog devices) but it should be more or less proportional. The numbers are separated
in 2 categories, during synch phase or outside of it.
Cote :We needed to set zfs_no_write_throttle to work around 6687412.
Scripts needs to run for >60 seconds to capture multiple sync phases.
While not synching, we handle 11 writes per zil_commit and zio_wait for 256 usec per zil_commit_writer.
avg write cnt not-synching 11
avg zio_wait ns not-synching 256106
avg commit ns not-synching 481365
While synching, we handle 33 writes per zil_commit and zio_wait for 8342 usec per zil_commit_writer.
avg write cnt synching 33
avg zio_wait ns synching 8342496
avg commit ns synching 9044736
So here we're doing 33/11 = 3 times more work per zil_commit_writer for 8342496/256106 = 33 times more zio_wait.
What I think this says is that during sync phase, the I/O scheduler is inducing extra latency even for the critical zio that went to the slog devices. What this is causing is a cyclical drop in the number of zfs_write that are handled; the latency of them is also impacted.
I produced this to an AR (fw_28 ~snv_87) using 6 clients (v2c02...) running the attached oltp2 benchmark.
Mountpoint is /v4
for h in v2c02 v2c04 v2c05 v2c06 v2c07 v2c08; do
ssh root@$h TIME=10 RRRATIO=100 oltp2 arhost /v4 1000000 100 1
done