OpenSolaris

Printable Version Enter a New Search
Bug ID 6437054
Synopsis vdev_cache wises up: increase DB performance by 16%
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:zfs
Keywords filebench
Responsible Engineer Eric Kustarz
Reported Against s10_43
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_70
Fixed In snv_70
Release Fixed solaris_nevada(snv_70) , solaris_10u6(s10u6_01) (Bug ID:2156248)
Related Bugs 6604198 , 6605998 , 4933977
Submit Date 10-June-2006
Last Update Date 29-April-2008
Description
From the zfs-discuss forum:

> RL> why is sum of disks bandwidth from `zpool iostat -v 1`
> RL> less than the pool total while watching `du /zfs`
> RL> on opensol-20060605 bits?
> 
> Due to raid-z implementation. See last discussion on raid-z
> performance, etc.

It's an artifact of the way raidz and the vdev read cache interact.

Currently, when you read a block from disk, we always read at least 64k.
We keep the result in a per-disk cache -- like a software track buffer.
The idea is that if you do several small reads in a row, only the first
one goes to disk.  For some workloads, this is a huge win.  For others,
it's a net lose.  More tuning is needed, certainly.

Both the good and the bad aspects of vdev caching are amplified by
RAID-Z.  When you write a 2k block to a 5-disk raidz vdev, it will be
stored as a single 512-byte sector on each disk (4 data + 1 parity).
When you read it back, we'll issue 4 reads (to the data disks);
each of those will become a 64k cache-fill read, so you're reading
a total of 4*64k = 256k to fetch a 2k block.  If that block is the
first in a series, you're golden: the next 127 reads will be free
(no disk I/O).  On the other hand, if it's an isolated random read,
we just did 128 times more I/O than was actually useful.

This is a rather extreme case, but it's real.  I'm hoping that by
making the higher-level prefetch logic in ZFS a little smarter,
we can eliminate the need for vdev-level caching altogether.
If not, we'll need to make the vdev cache policy smarter.
FWIW, my take is that the vdev_cache 64K prefetch causes a largish 
performance issue when the I/O becomes saturated for sustained period. 
Mark's work on 6429205 could allow marking a pool as I/O saturated or not
and this could be used as a hint to control the size of vdev level prefeching.
vdev prefetching really hurts when doing random 
i/o (eg, for databases).  One easy thing we could 
do to improve performance is to not do vdev 
prefetching if the read is on a filesystem with 
recordsize set.  This would be justifiable 
because they should only be setting recordsize 
if they expect random i/o, where prefetch of any 
kind is not likely to help.
Work Around
To disable the vdev cache, set 'zfs_vdev_cache_max' to 1.

To make it a permanent change, add this to /etc/system:
set zfs:zfs_vdev_cache_max = 0x1
Comments
N/A