|
Description
|
From the zfs-discuss forum:
> RL> why is sum of disks bandwidth from `zpool iostat -v 1`
> RL> less than the pool total while watching `du /zfs`
> RL> on opensol-20060605 bits?
>
> Due to raid-z implementation. See last discussion on raid-z
> performance, etc.
It's an artifact of the way raidz and the vdev read cache interact.
Currently, when you read a block from disk, we always read at least 64k.
We keep the result in a per-disk cache -- like a software track buffer.
The idea is that if you do several small reads in a row, only the first
one goes to disk. For some workloads, this is a huge win. For others,
it's a net lose. More tuning is needed, certainly.
Both the good and the bad aspects of vdev caching are amplified by
RAID-Z. When you write a 2k block to a 5-disk raidz vdev, it will be
stored as a single 512-byte sector on each disk (4 data + 1 parity).
When you read it back, we'll issue 4 reads (to the data disks);
each of those will become a 64k cache-fill read, so you're reading
a total of 4*64k = 256k to fetch a 2k block. If that block is the
first in a series, you're golden: the next 127 reads will be free
(no disk I/O). On the other hand, if it's an isolated random read,
we just did 128 times more I/O than was actually useful.
This is a rather extreme case, but it's real. I'm hoping that by
making the higher-level prefetch logic in ZFS a little smarter,
we can eliminate the need for vdev-level caching altogether.
If not, we'll need to make the vdev cache policy smarter.
FWIW, my take is that the vdev_cache 64K prefetch causes a largish
performance issue when the I/O becomes saturated for sustained period.
Mark's work on 6429205 could allow marking a pool as I/O saturated or not
and this could be used as a hint to control the size of vdev level prefeching.
vdev prefetching really hurts when doing random
i/o (eg, for databases). One easy thing we could
do to improve performance is to not do vdev
prefetching if the read is on a filesystem with
recordsize set. This would be justifiable
because they should only be setting recordsize
if they expect random i/o, where prefetch of any
kind is not likely to help.
|