|
Description
|
xxxxx@xxxxx.com 2001-11-16
The SunCluster failfast implementation in the kernel has three
variables that are closely related:
conf.failfast_grace_time
conf.failfast_panic_delay
conf.failfast_panic_proxy_delay
Their values are (respectively) 10, 30, and 5 seconds.
The comments in the implementation source code explaining
how these interact, and how the values were chosen, need
to be improved. I'm most interested in seeing comments
explaining the purpose of failfast_grace_time and
failfast_panic_proxy_delay.
This bug report should not be taken as implying that
these concepts are not necessary, nor that the values
are not good choices. If I had to guess, I would guess
that the purpose of some of the extra delay (failfast_grace_time
and failfast_panic_proxy_delay) is to enable some time
for syslog etc to log the messages before we actually panic.
This is just a guess on my part; there may be other reasons.
The net effect of the interaction of these values seems to
be that, in practice, when a user-land daemon dies, it may
take the node up to 45 seconds to panic rather than the 30
that failfast_panic_delay==30 would have suggested. I'm
not implying that 45 is too long, though.
This bugid is a request for comments in the source code;
I'm not suggesting explaining this to customers/SE's at
this stage of our lives.
|