|
Description
|
The field has notified HAS after installing and running with the binaries
suggested in escalation (537799) the domain hung, was STOP-A'd, and a core file
was produced. CPRE engineer xxxxx@xxxxx.COM suggested that we open
a new bug to track this issue.
The new core file can be located at:
/net/cores.east/cores/dir4/63037713/gw_gdl_hang2.core.tar.Z
From the core file:
files: /net/cores.east/cores/dir4/63037713/*.1
release: 5.6
version: Generic_105181-31
machine: sun4u
node name: gw_gdl
hw_provider: Sun_Microsystems
system type: SUNW,Ultra-Enterprise-10000
hostid: 0x80a6662c
snooping: 0x1
boothowto: 0x40 (DEBUG)
time of crash: Thu Jun 6 11:31:03 EDT 2002
age of system: 5 days 18 hours 6 minutes 35.69 seconds
panic string: zero
==== system appears to have been Stop-A'ed ====
==== reporting thread summary ====
220 threads ran in the last second
631 threads ran in the last minute
32 runnable threads
0 zombied threads
4 stopped threads
178 free threads
0 mutexes pending
0 rwlocks pending
1779 condition variables pending
1 semaphores pending
40 user level sobjs pending
33 shuttle sobjs pending
1 threads in biowait
2124 total threads
No real memory pressure:
==== reporting memory and swap stats ====
freemem 291573 pages (2.22G)
avefree 291497 pages (2.22G)
avefree30 292025 pages (2.22G)
availrmem (nonswapable) 2536276 pages (19.3G)
availrmem_initial 2578847 pages (19.6G)
swapfs_minfree 325083 pages (2.48G)
spt_used 0 pages (0)
kernel heap size 328089600 bytes (312M) (8.1% of max (3.75G))
kernel page table size 165046464 bytes (157M)
kernel smap table size 1048576 bytes (1M)
priority paging enabled
dyncachefree 80588 pages (629M)
cachefree 80588 pages (629M)
lotsfree 40294 pages (314M)
desfree 20147 pages (157M)
minfree 10073 pages (78.6M)
throttlefree 10073 pages (78.6M)
physmem 2600666 pages (19.8G)
total_pages 2578847 pages (19.6G)
After talking to the field team, they report that they were unable to ping
the domain, and they were unable to get a response at the domain console
via netcon. At which point they were able to send a break sequence, and typed
in sync. At this point the sync stopped so from the ssp they issued a sigbcmd
-p 6 and a core file was generated. The field also reports that
after this hang, they installed the new binaries suggested in the escalation
(6/4/2002) and 1 day later the domain "hung" again. This time they were unable
to capture a core file (due to lack of disk space).
comments from CPRE regarding findings from core file.
I think I made a mistake in taking out the thread which clock is trying to
wakeup. The correct thread is in %i2 of sleepq_wakeone_chan(). I took
the first thread from sleepq. I had that in mind because I was able to
see cv_waiters as 0x1, but according my old theory , it should have been zero.
According to new theory, cv_waiters being 0x1 is correct because there is one
more tid waiting on this CV. Traversing the t_link of sq_queue.sq_first of
0x104a0b04 will tell you what is right.
So, clock is trying to put tid (040043e80) which is part of callout thread pool.
> 40043e80::findstack
stack pointer for thread 40043e80: 40043c20
[ 40043c20 cv_wait+0x40() ]
40043c80 callout_thread+0x38()
40043ce0 thread_start+4()
> 40043e80::print kthread_t ! grep link
t_link = 0
> 40043e80::print kthread_t ! grep chan
t_wchan0 = 0
t_wchan = 0
> 40043e80::print kthread_t ! grep t_lockp
t_lockp = sleepq_head+0xbc4
> 40043e80::print kthread_t ! grep ops
t_sobj_ops = 0
t_copyops = default_copyops
>
So, clock does appear to be stuck in setbackdq() because
sleepq_wakeone_chan+0x50 takes us to setbackdq().
So, the next question is how long clock is spinning on the CPU 56 in
cv_signal() -> sleepq_wakeone_chan() -> setbackdq(). Infact where clock
has been spinning in setbackdq().
kadb would be able to tell us more on this. Infact you can try doing "$r"
on kadb prompt to dump the registers apart from "$C".
Ideally, this hang is different from dispq panic which this esc (537799) is
addressing. I would suggest ES to raise a new escalation for this hang.
It is also quite possible for clock to content on these locks, but make
progress. We do see that last time clock thread was switch out on CPU 56 is
14 seconds back. It means that clock hasn't switch'ed afterwards causing
machine to hang. As I showed paths where clock can content on mutex/dispactcher
locks in the previous update, we need to find why clock is behind. If next
crash dump doesn't help us, then we might go for TRAPTRACE kernel or
instrumented kernel to see what is happening on that CPU. We would most likely
see level-14 interrupt firing and clock spinning in some routines. With the
help of TRAPTRACE kernel we would be able to see where level-14 interrupt
is pinning the clock thread and finally where clock thread is spinning.
|