|
Description
|
HSFS currently in Solaris performs poorly with lots of small
reads taking place and no readahead. There is a lot of scope
to improve performance. The max throughput that can be achieved
today even with the fastest data DVDs and straight sequential
read is a meagre 3.4 MB/s.
In addition these changes discussed below are vital for building
Live bootable CDs and DVDs as they reduce the bootup time of
Solaris when booting from CD/DVD - an opensolaris LiveCD becomes
practical. With Solaris moving towards a LiveCD based installer,
this is needed.
HSFS Performance Enhancements
-----------------------------
The HSFS filesystem module is Solaris performs somewhat slower
than the competition like Linux. CD performance is especially
important not only for multimedia apps but also for Live bootable
CDs which is the direction being taken with the Indiana project.
With these requirements in mind a couple of enhancements were
done to the HSFS implementation in OpenSolaris:
1) Addition of an I/O Scheduler
2) Read-Ahead
The hsfs implementation in Solaris suffers for quite a bit of
drive head seeking and numerous read requests being issued
in small 2K chunks. Also no read-ahead is performed. The above
enhancements aim to reduce these problems.
The I/O scheduler implements an Elevator algorithm. The read
requests are sorted as per the the logical block number and
issued to the device in that order. In addition the scheduler
attempts to check if subsequent requests are adjacent to each
other. If so then it will merge those requests into a single
larger request, which is then delivered to the drive. The
algorithm starts scanning I/O requests in ascending order of
logical block number till no more higher numbered requests
remain at which time it starts again from the lowest numbered
read request if any. This is thus a 1-way forward merge
Elevator also called as Circular Look. This algorithm is
suitable for CD/DVD media as these media have a single outward
spiralling circular track with poor random-seek behavior.
In addition the I/O scheduler also checks for deadline
expiration to prevent starvation. A deadline of 500ms
is used for reads.
The read-ahead obviously attempts to detect sequential file
access pattern and "warms" the page cache by preloading pages
of data before the application requests for them. This is
benefitial as most scenarios of CD/DVD usage represent typical
single-threaded sequential access behavior. The sequential
pattern detection is not perfect however and can be ineffective
in the face of multiple threads reading the same file. But as
mentioned already this is not a big issue for general CD/DVD
usage. Also, so as not to swamp the page cache, the UFS
freebehind logic is being used here, see:
http://monaco.sfbay/detail.jsf?cr=6207772
Sequential access is assumed if there are more than 2 read
requests in ascending order and adjacent to each other. The
read-ahead logic then slowly ramps up faulting in additional
pages. The number of pages preloaded is increased with every
subsequent sequential read upto a maximum of 4. In addition
read-ahead is performed only if a successful cache-hit of an
earlier faulted page occurrs. This behavior has the tendency to
throttle read-ahead in case of cache misses, since that will
indicate memory pressure and caching inefficiency. So
preloading pages will mean waste of bandwidth. Also the read-
ahead count is decremented with every non-sequential access.
As mentioned above read-ahead is only issued on a cache-hit,
but with the additional condition that the subsequent page
is not already in the cache. So in the ideal case this behavior
will mean that the application never waits for disk I/O.
Another side benefit of doing read-ahead in sequential access
is that it provides enough meat to the I/O scheduler so that
it can optimize and coalesce the subsequent reads achieving
higher throughput. A application reading one or two pages at
a time does not exercise the I/O scheduler much.
Code Overview
-------------
usr/src/uts/common/fs/hsfs/hsfs_vfsops.c:
The modifications in this file deal with initializing the
required data structures during mount. A global variable is
checked to determine whether to enable these features or
not. This is done from a debugging perspective:
int do_schedio = 1;
The variable can be toggled via mdb prior to mounting to
disable these features.
static int
hs_mountfs(
...
...
if (do_schedio) {
fsp->hqueue = kmem_alloc(sizeof(struct hsfs_queue), KM_SLEEP);
hsched_init(fsp, fsid, &modlinkage);
}
...
}
Cleanup is done via a call to hsched_fini in hsfs_unmount.
usr/src/uts/common/sys/fs/hsfs_node.h
The read-ahead and I/O scheduler data structures are defined
in this file. Read-ahead computation adds 3 new variables to
the hsnode structure:
u_offset_t hs_prev_offset; /* Last read end offset (readahead) */
int hs_num_contig; /* Count of contiguous reads */
int hs_ra_bytes; /* Bytes to readahead */
The other data structures are:
struct hio - A structure that holds information for a read
request that is enqueued for processing by the
scheduling function. An AVL tree is used to
access the read requests in a sorted manner.
struct hio_info - A structure that holds information about
all the read requests issued during a read-ahead
invocation. This is then enqueued on a task-
queue for processing by a thread that takes
this read-ahead to completion and cleans up.
struct hsfs_queue - This is per-filesystem structure that
stores toplevel data structures for the I/O
scheduler.
The hsfs filesystem structure is obviously modified to contain
a pointer to a struct hsfs_queue.
usr/src/uts/common/fs/hsfs/hsfs_node.c
Very simple changes to initialize the read-ahead counters
when initializing a hsnode.
usr/src/uts/common/fs/hsfs/hsfs_vnops.c
This file contains 90% of the changes. Most of it is new code
addition with changes to the hsfs_read, hsfs_getpage and
hsfs_getapage routines.
The changes to hsfs_read deal with doing the freebehind
correctly if read-ahead is in effect. This is not different
from the same implementation in UFS.
The changes to hsfs_getpage deal with updating the read-ahead
counters based on the vnode, offset and length of data being
read and what was the end offset of the previous read on the
same file. The code is fairly commented.
The changes to hsfs_getapage deal with creating the struct
hio requests and enqueuing them for processing by the I/O
scheduler. It also checks for read-ahead and invokes the
read-ahead routine if several conditions are met. These
conditions were mentioned towards the beginning of this
document (5th para).
It is pertinent to note here the I/O scheduling function
hsched_invoke_strategy does Not run in a separate thread.
Instead the caller, hsfs_getapage is expected to repeatedly
invoke this function till it's I/O requests have been
satisfied. In practice this has found to give low-overhead
high performance. Since the hsched_invoke_strategy acquires
the strategy_lock on entry to ensure single-threaded operation,
all except one thread will be sleeping on this mutex rather
than busy-waiting that the text above seems to imply.
All of the new code in hsfs_getapage is commented and not
too difficult to follow. At all points a check is made to
see whether these features are to be used - for debuggability.
This code calls the read-ahead function if we have a cache
hit, we are doing sequential read and the next page is not
in the cache:
if (fsp->hqueue != NULL &&
hp->hs_prev_offset - off == pgsize &&
hp->hs_prev_offset < filsiz &&
hp->hs_ra_bytes > 0 &&
!page_exists(vp,hp->hs_prev_offset)) {
hsfs_getpage_ra(vp, hp->hs_prev_offset, seg,
addr + pgsize, hp, fsp, xarsiz, bof,
chunk_lbn_count, chunk_data_bytes);
}
hsfs_getpage_ra is essentially a simplified version of
hsfs_getapage. It does most of the same processing but
puts the read requests on a queue for processing via a
background kernel thread:
bufsused = count;
info = kmem_alloc(sizeof (struct hio_info), KM_SLEEP);
info->bufs = bufs;
info->vas = vas;
info->sema = fio_done;
info->bufsused = bufsused;
info->bufcnt = bufcnt;
info->hqueue = fsp->hqueue;
info->pp = pp;
(void) taskq_dispatch(fsp->hqueue->ra_task,
hsfs_ra_task, info, KM_SLEEP);
hsfs_ra_task runs when the ra_task queue has been fed
some requests. It in turn invokes the scheduling function
until it's requests are serviced. It then does a cleanup
and releases the I/O lock on the pages. The ra_task
queue is a dynamic task Q since it is performance sensitive.
To be fully effective the read-ahead should complete before
or just-before the application comes back with the request
for that page. For a purely sequential single-threaded read
from DVD a drop in throughput was observed in practice when
using a non-dynamic task Q.
hsched_invoke_strategy contains the real meat of the I/O
scheduler. First it grabs it's own lock and then it
grabs the lock that protects the AVL trees. It then checks
the deadline tree to see whether the oldest requests has
exceeded the deadline. If yes then that request is used
as the starting point.
Otherwise it looks at the read tree sorted ascending by
LBN and fetches the request with the next higher block
number from the read request that was processed earlier.
If there are no such requests in the queue then it fetches
the one with the lowest logical block number. This is what
gives the Circular Look behavior. This is the code segment
responsible for that:
fio = avl_find(&hqueue->read_tree, hqueue->next, &pos);
if (fio != NULL)
fio = AVL_NEXT(&hqueue->read_tree, fio);
else
fio = avl_nearest(&hqueue->read_tree, pos, AVL_AFTER);
if (fio == NULL) {
fio = avl_first(&hqueue->read_tree);
}
Here hqueue->next is a dummy struct hio that holds the
logical block number of the last processed read request.
avl_find will either return a node having the given
value or if it does not exist will return NULL and pos
will point to the insertion point. Both cases are handled.
Subsequently the code does a forward(ascending block number)
coalescing of buffers that are adjacent to each other. The
avl tree is traversed in order via AVL_NEXT and all the
adjacent buffers are put into a linked list.
Next if adjacent buffers were detected then a new buf
structure is synthesized. This is somewhat different from
a normal buf that one would get via getrbuf or bioclone.
In particular the buf points to a kmem_alloc-ed chunk.
Also the buf structure itself is allocated once during
mount and re-used every time through the scheduling
function as it is single-threaded.
Finally this buf is then dispatched and it waits for
the I/O to complete. Once data was received successfully
then the blocks are copied back into the original bufs
that have been sent down from the caller and biodone is
signaled for each.
Error is handled by looking at the b_resid buf member.
b_resid will indicate how much data was not processed
for the I/O. So we can find out which of the caller's
original bufs are good to go and which failed and signal
appropriately.
Initialization: hsched_init performs the initialization
and is called during mount. It sets up the mutexes, the
avl trees and the read-ahead taskQ. The maximum I/O
transfer size supported by the device is probed using
ldi_ioctl, the default is to assume a conservative
value of 16K in case the ldi_ioctl is not successful.
The read-ahead size is also set here. Ordinarily we'd
read-ahead 4 pages worth of data, but it is reduced to
1 page in case we are using large pages.
The function hsched_fini does all the cleanup.
hsfs_deadline_compare and hsfs_offset_compare are the
comparison functions used for the avl trees. These look
and behave suspiciously similar to similar functions in:
usr/src/uts/common/fs/zfs/vdev_queue.c
I spent some time testing this stuff quite a bit and used filebench as well. Interestingly filebench (and probably even iozone) are written with writable filesystems in mind. I ran into problems using it on hsfs which is read-only. So I had to make changes to filebench to get it to work with hsfs. For eg. it opens files with O_RDWR even for the read tests. The workload scripts needed a change to use predefined files instead of creating a new one.
I used the random multi-thread read, single-stream sequential read and multi-stream sequential read tests with various chunk sizes.
Using filebench gave me some more insights and helped improve the code/performance. Here are the changes I did since last time:
- The logic used to implement the 1-way Elevator (Circular Look) was too expensive.
It involved two traversals of the AVL tree while holding a lock. Here's what I
was doing earlier:
fio = avl_find(&hqueue->read_tree, hqueue->next, &pos);
if (fio != NULL)
fio = AVL_NEXT(&hqueue->read_tree, fio);
else
fio = avl_nearest(&hqueue->read_tree, pos,
AVL_AFTER);
The combination of avl_find and avl_nearest was too expensive. After a bunch of
meddling I hit upon a way to use the last processed I/O node of the current
invocation as a sentinel for the next invocation.
That way the code boils down to just a simple AVL_NEXT:
fio = AVL_NEXT(&hqueue->read_tree, hqueue->next);
avl_remove(&hqueue->read_tree, hqueue->next);
This made a difference.
- filebench showed a degradation in performance for small 2k-4k random I/O by
multiple threads whereas bigger chunks of I/O showed a big benefit.
That essentially boils down to the non-coalescing case. The last else case in
hsched_invoke_strategy. That'd simply issue a bdev_strategy and biowait and then
release the I/O lock. That was not enough to keep the I/O pipe and the device
sufficiently busy. So I changed it to release the lock before calling biowait as
at that point there is no shared data to worry about. This change actually
improved the small random read performance compared to the vanilla hsfs and the
benefits of re-ordering were visible.
- The normal hsfs module always issues reads in 2K chunks. This was a result of the
need to support file data interleaving on older hardware. However from what I see
interleaving is hardly used today. HSFS computes the interleaving chunk size and
sets it to the HSFS logical block size of 2K when there is no interleaving. This
is wasteful and results in 2K reads even when the file data is contiguous and we
can read whole pages at a time.
Thus I made a small tweak to actually set the chunk size to the page size if the
page size is a multiple of the logical block size. This resulted in much lesser
processing overhead and the I/O scheduler is better able to coalesce:
if (hp->hs_dirent.intlf_sz == 0) {
chunk_data_bytes = LBN_TO_BYTE(1, vp->v_vfsp);
/*
* Optimization: If our pagesize is a multiple of LBN
* bytes, we can avoid breaking up a page into individual
* lbn-sized requests.
*/
if (pgsize % chunk_data_bytes == 0) {
chunk_lbn_count = BYTE_TO_LBN(pgsize, vp->v_vfsp);
chunk_data_bytes = pgsize;
}
...
##################################################
# Test Results
##################################################
A bunch of testing was performed on a Thinkpad T60p laptop and the results are posted below. Testing on SPARC is discussed on another note. Filebench was used to collect metrics on a bunch of testcases. Filebench had to be modified slightly to make it work with a read-only filesystem.
Since these are performance tests first the baseline metrics and metrics from the enhanced module are compared below. A test DVD with 3 1GB files were used. A SXDE B70 DVD was also used in the tests.
It will be clear from the results below that there is a general improvement in performance sometimes upto 50% reduction in the time taken. In addition the reduction in system time is also quite huge. The biggest reason for this is indicated by the DTrace outputs at the end. The current hsfs module always issues reads on 2K size which results in a huge number of requests going down to the device. The enhanced module on the other hand never issues 2K requests. It is 4K or larger, resulting in decrease in number of physical reads and reduction in system load.
The iostat outputs at the end clearly show increased throughput. Since this is was a laptop, the DVD drive takes more time to reach max throughput. On a desktop in earlier use the throughput was observed to go near to the device maximum when copying large files.
------------------
Straight file copy
------------------
## BASELINE ##
bash-3.00# time cp /mnt/cdrom/largefile1 /space/
real 6m3.944s
user 0m0.002s
sys 0m20.527s
## ENHANCED ##
bash-3.00# time cp /mnt/cdrom1/largefile1 /space/
real 3m6.583s
user 0m0.001s
sys 0m15.917s
----------------
dd of a 1GB File
----------------
## BASELINE ##
bash-3.00# time dd if=/media/TestDVD/largefile1 of=/dev/null bs=8192 count=131072
131072+0 records in
131072+0 records out
real 6m4.996s
user 0m0.228s
sys 0m12.340s
## ENHANCED ##
bash-3.00# time dd if=/media/TestDVD/largefile1 of=/dev/null bs=8192 count=131072
131072+0 records in
131072+0 records out
real 3m10.065s
user 0m0.170s
sys 0m3.033s
----------------------
cpio of a SXDE B70 DVD
----------------------
## BASELINE ##
bash-3.00# time find . | cpio -pdum /space/work/dvd
6944256 blocks
real 36m15.867s
user 0m1.885s
sys 2m46.268s
## ENHANCED ##
bash-3.00# time find . | cpio -pdum /space/work/dvd
6944256 blocks
real 28m22.953s
user 0m1.467s
sys 0m37.423s
---------------------
Tar of a SXDE B70 DVD
---------------------
## BASELINE ##
bash-3.00# time tar cpf - . | cat > /dev/null
real 86m56.636s
user 0m2.822s
sys 2m12.932s
## ENHANCED ##
bash-3.00# time tar cpf - . | cat > /dev/null
real 79m31.187s
user 0m2.438s
sys 0m30.000s
--------------------------------------------------------------------------------
Tar of a lofi mounted SXDE ISO image, the ISO image residing on a UFS filesystem
--------------------------------------------------------------------------------
bash-3.00# time tar cpf - . | cat > /dev/null
real 3m7.076s
user 0m3.108s
sys 1m33.591s
## ENHANCED ##
bash-3.00# time tar cpf - . | cat > /dev/null
real 2m26.876s
user 0m2.728s
sys 0m27.414s
--------------------------------------------------------------------------------
Tar of a lofi mounted SXDE ISO image, the ISO image residing on a ZFS filesystem
--------------------------------------------------------------------------------
## BASELINE ##
bash-3.00# time tar cpf - . | cat > /dev/null
real 2m38.084s
user 0m3.162s
sys 1m34.831s
## ENHANCED ##
bash-3.00# time tar cpf - . | cat > /dev/null
real 1m35.409s
user 0m2.616s
sys 0m26.476s
------------------------------------------------------------------------
Number of physical I/O requests issued by hsfs from the dd of a 1GB file
------------------------------------------------------------------------
## BASELINE ##
bash-3.00# dtrace -n 'fbt::bdev_strategy:entry { @[execname] = count(); }'
dtrace: description 'fbt::bdev_strategy:entry ' matched 1 probe
^C
fsflush 4
firefox-bin 5
sched 59
dd 524286
## ENHANCED ##
bash-3.00# dtrace -n 'fbt::bdev_strategy:entry { @[execname] = count(); }'
dtrace: description 'fbt::bdev_strategy:entry ' matched 1 probe
^C
dd 13
fsflush 19
sched 65694
------------------------------------------------------
Block sizes issued from hsfs due to a dd of a 1GB file
------------------------------------------------------
## BASELINE ##
bash-3.00# dtrace -n 'io:::start { @[execname, args[2]->fi_pathname] = lquantize(args[0]->b_bcount, 0, 32767, 2048); }'
dtrace: description 'io:::start ' matched 3 probes
^C
...
...
dd /media/TestDVD/largefile3
value ------------- Distribution ------------- count
0 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 524288
4096 | 0
## ENHANCED ##
>> The Blocks accounted to sched below are actually the async read-ahead requests from hsfs
>>
bash-3.00# dtrace -n 'io:::start { @[execname, args[2]->fi_pathname] = lquantize(args[0]->b_bcount, 0, 65536, 2048); }'
dtrace: description 'io:::start ' matched 3 probes
^C
...
...
dd /media/TestDVD/largefile3
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@@ 2
6144 | 0
8192 | 0
10240 | 0
12288 |@@@@ 1
14336 | 0
16384 | 0
18432 | 0
20480 |@@@@ 1
22528 | 0
24576 |@@@@@@@@@@@@@@@@@@@@@@@@@ 7
26624 | 0
sched /media/TestDVD/largefile3
value ------------- Distribution ------------- count
10240 | 0
12288 | 1
14336 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 65522
18432 | 0
-----------------------------------------
Iostat output snippet from dd of 1GB file
-----------------------------------------
## BASELINE ##
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0
1515.6 0.0 3031.2 0.0 0.4 1.0 0.3 0.6 43 96 c0t0d0
cpu
us sy wt id
1 7 0 91
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.2 0.0 0.2 0.0 0.0 0.0 1.5 0 0 c1t0d0
1515.4 0.0 3030.8 0.0 0.4 1.0 0.3 0.6 43 96 c0t0d0
cpu
us sy wt id
1 7 0 92
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0
1506.8 0.0 3013.6 0.0 0.4 1.0 0.3 0.6 43 96 c0t0d0
## ENHANCED ##
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0
467.4 0.0 5608.9 0.0 0.0 0.9 0.0 2.0 0 94 c0t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t0d0
...
...
cpu
us sy wt id
2 6 0 92
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0
548.4 0.0 6581.2 0.0 0.0 0.9 0.0 1.7 0 93 c0t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t0d0
cpu
us sy wt id
2 6 0 92
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t0d0
553.0 0.0 6636.0 0.0 0.0 0.9 0.0 1.7 0 93 c0t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t0d0
##################################################
# Filebench test results on x86
##################################################
12 Filebench testcases were used:
- Random read with 2K block size, 16 threads
- Random read with 8K block size, 16 threads
- Random read with 32K block size, 16 threads
- Random read with 64K block size, 16 threads
- Random read with 1M block size, 16 threads
- Single stream read with 8K blocksize
- Single stream read with 32K blocksize
- Single stream read with 1M blocksize
- Multi stream read with 8K blocksize
- Multi stream read with 32K blocksize
- Multi stream read with 1M blocksize
- Mixed read 4 threads doing a mixture of random and sequential I/O on same/different files
The Test DVD used for this had 3 1GB files. The results are reproduced below. It is clear that there is an across the board benefit though the benefits for random read are marginal. The single stream reads show massive benefit due to I/O coalescing and read-ahead caching that provide the illusion of higher bandwidth than the device can support to the application.
The Filebench and previous tar tests indicate possibilities for additional improvement in terms of better caching metadata to reduce seeks and doing read-ahead even for random reads with large chunk sizes. However those are possibilities for another rfe.
---------------------------
randomread2k
---------------------------
## BASELINE ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 15ops/s 0.0mb/s 2149.3ms/op 29us/op-cpu
IO Summary: 4474 ops 14.8 ops/s, 15/0 r/w 0.0mb/s, 624uscpu/op
## ENHANCED ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 15ops/s 0.0mb/s 2156.4ms/op 26us/op-cpu
IO Summary: 4494 ops 14.9 ops/s, 15/0 r/w 0.0mb/s, 479uscpu/op
---------------------------
randomread8k
---------------------------
## BASELINE ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 15ops/s 0.1mb/s 2191.6ms/op 45us/op-cpu
IO Summary: 4383 ops 14.5 ops/s, 15/0 r/w 0.1mb/s, 744uscpu/op
## ENHANCED ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 15ops/s 0.1mb/s 2150.8ms/op 42us/op-cpu
IO Summary: 4508 ops 14.9 ops/s, 15/0 r/w 0.1mb/s, 1278uscpu/op
---------------------------
randomread32k
---------------------------
## BASELINE ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 3ops/s 0.1mb/s 9669.6ms/op 163us/op-cpu
IO Summary: 984 ops 3.3 ops/s, 3/0 r/w 0.1mb/s, 3096uscpu/op
## ENHANCED ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 3ops/s 0.1mb/s 9609.6ms/op 145us/op-cpu
IO Summary: 997 ops 3.3 ops/s, 3/0 r/w 0.1mb/s, 2344uscpu/op
---------------------------
randomread64k
---------------------------
## BASELINE ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 2ops/s 0.1mb/s 19598.3ms/op 310us/op-cpu
IO Summary: 484 ops 1.6 ops/s, 2/0 r/w 0.1mb/s, 6535uscpu/op
## ENHANCED ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 2ops/s 0.1mb/s 19641.2ms/op 288us/op-cpu
IO Summary: 481 ops 1.6 ops/s, 2/0 r/w 0.1mb/s, 6536uscpu/op
---------------------------
randomread1m
---------------------------
## BASELINE ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.1mb/s 326255.2ms/op 5036us/op-cpu
IO Summary: 32 ops 0.1 ops/s, 0/0 r/w 0.1mb/s, 184047uscpu/op
## ENHANCED ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.1mb/s 300453.0ms/op 4166us/op-cpu
IO Summary: 34 ops 0.1 ops/s, 0/0 r/w 0.1mb/s, 165575uscpu/op
---------------------------
singlestreamread8k
---------------------------
## BASELINE ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread 366ops/s 2.9mb/s 2.7ms/op 98us/op-cpu
IO Summary: 110724 ops 366.2 ops/s, 366/0 r/w 2.9mb/s, 363uscpu/op
## ENHANCED ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread 19754ops/s 154.3mb/s 0.0ms/op 17us/op-cpu
IO Summary: 5969765 ops 19754.4 ops/s, 19754/0 r/w 154.3mb/s, 18uscpu/op
---------------------------
singlestreamread32k
---------------------------
## BASELINE ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread 92ops/s 2.9mb/s 10.9ms/op 370us/op-cpu
IO Summary: 27754 ops 91.8 ops/s, 92/0 r/w 2.9mb/s, 1433uscpu/op
## ENHANCED ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread 7185ops/s 224.5mb/s 0.1ms/op 52us/op-cpu
IO Summary: 2172130 ops 7184.6 ops/s, 7185/0 r/w 224.5mb/s, 57uscpu/op
---------------------------
singlestreamread1m
---------------------------
## BASELINE ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread 3ops/s 2.9mb/s 347.9ms/op 11671us/op-cpu
IO Summary: 867 ops 2.9 ops/s, 3/0 r/w 2.9mb/s, 45757uscpu/op
## ENHANCED ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread 268ops/s 268.1mb/s 3.7ms/op 1484us/op-cpu
IO Summary: 81049 ops 268.1 ops/s, 268/0 r/w 268.1mb/s, 1633uscpu/op
---------------------------
multistreamread8k
---------------------------
## BASELINE ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread3 2ops/s 0.0mb/s 452.8ms/op 41us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread2 2ops/s 0.0mb/s 452.6ms/op 41us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 2ops/s 0.0mb/s 452.4ms/op 42us/op-cpu
IO Summary: 2001 ops 6.6 ops/s, 7/0 r/w 0.0mb/s, 1338uscpu/op
## ENHANCED ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread3 4ops/s 0.0mb/s 261.0ms/op 37us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread2 5ops/s 0.0mb/s 204.0ms/op 31us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 5ops/s 0.0mb/s 203.5ms/op 31us/op-cpu
IO Summary: 4071 ops 13.5 ops/s, 13/0 r/w 0.1mb/s, 877uscpu/op
---------------------------
multistreamread32k
---------------------------
## BASELINE ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread3 1ops/s 0.0mb/s 1803.5ms/op 151us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread2 1ops/s 0.0mb/s 1805.5ms/op 135us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 1ops/s 0.0mb/s 1804.6ms/op 147us/op-cpu
IO Summary: 501 ops 1.7 ops/s, 2/0 r/w 0.0mb/s, 5440uscpu/op
## ENHANCED ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread3 1ops/s 0.0mb/s 747.1ms/op 100us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread2 0ops/s 0.0mb/s 5424.4ms/op 754us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 1ops/s 0.0mb/s 744.7ms/op 107us/op-cpu
IO Summary: 872 ops 2.9 ops/s, 3/0 r/w 0.1mb/s, 2596uscpu/op
---------------------------
multistreamread1m
---------------------------
## BASELINE ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread3 0ops/s 0.0mb/s 57412.9ms/op 4432us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread2 0ops/s 0.0mb/s 57478.9ms/op 4466us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 0ops/s 0.0mb/s 57449.6ms/op 4272us/op-cpu
IO Summary: 15 ops 0.0 ops/s, 0/0 r/w 0.0mb/s, 151319uscpu/op
## ENHANCED ##
Flowop totals:
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread3 0ops/s 0.0mb/s 71125.5ms/op 9738us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread2 0ops/s 0.0mb/s 21493.7ms/op 3104us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 0ops/s 0.0mb/s 21687.5ms/op 3051us/op-cpu
IO Summary: 30 ops 0.1 ops/s, 0/0 r/w 0.1mb/s, 183635uscpu/op
---------------------------
mixedread
---------------------------
## BASELINE ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read2 0ops/s 0.0mb/s 77590.5ms/op 4326us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 0ops/s 0.0mb/s 77433.9ms/op 4409us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread 2ops/s 0.0mb/s 609.1ms/op 43us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 4ops/s 0.0mb/s 554.5ms/op 34us/op-cpu
IO Summary: 1592 ops 5.3 ops/s, 5/0 r/w 0.0mb/s, 1606uscpu/op
## ENHANCED ##
Flowop totals:
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read2 0ops/s 0.0mb/s 94251.1ms/op 9330us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread1 0ops/s 0.0mb/s 87424.1ms/op 7679us/op-cpu
limit 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
seqread 4ops/s 0.0mb/s 224.2ms/op 32us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-rate 0ops/s 0.0mb/s 0.0ms/op 0us/op-cpu
rand-read1 3ops/s 0.0mb/s 616.5ms/op 60us/op-cpu
IO Summary: 2351 ops 7.8 ops/s, 8/0 r/w 0.1mb/s, 1669uscpu/op
##################################################
# Filebench test results on SPARC
##################################################
The same Filebench tests as before were run on a T2000 system and similar results appeared. The random access tests actually delivered better numbers compared to baseline on the T2000.
These tests were also repeated with kmem_flags = 0x1f.
Please see comments for detailed SPARC results. We have a massive bug description field and we are hitting the limit on how much text it can contain.
|