OpenSolaris

Printable Version Enter a New Search
Bug ID 4699850
Synopsis t_state not reaching global visibility in door_return() before t_lockp
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:other
Keywords door_return | sunalert-submit-availability | thread_onproc
Responsible Engineer Saurabh Mishra
Reported Against 2.6 , 5.6 , 5.7
Duplicate Of
Introduced In solaris_2.5
Commit to Fix s10_16
Fixed In s10_16
Release Fixed solaris_10(s10_16) , solaris_9u3(s9u3_03) (Bug ID:2055459)
Related Bugs 4545596 , 4712576 , 5017148
Submit Date 10-June-2002
Last Update Date 24-November-2005
Description
The field has notified HAS after installing and running with the binaries
suggested in escalation (537799) the domain hung, was STOP-A'd, and a core file
was produced.   CPRE engineer  xxxxx@xxxxx.COM suggested that we open 
a new bug to track this issue.

The new core file can be located at:

/net/cores.east/cores/dir4/63037713/gw_gdl_hang2.core.tar.Z


From the core file:

files:          /net/cores.east/cores/dir4/63037713/*.1
release:        5.6
version:        Generic_105181-31
machine:        sun4u
node name:      gw_gdl
hw_provider:    Sun_Microsystems
system type:    SUNW,Ultra-Enterprise-10000
hostid:         0x80a6662c
snooping:       0x1
boothowto:      0x40 (DEBUG)
time of crash:  Thu Jun  6 11:31:03 EDT 2002
age of system:  5 days 18 hours 6 minutes 35.69 seconds
panic string:   zero


==== system appears to have been Stop-A'ed ====


==== reporting thread summary ====

        220     threads ran in the last second
        631     threads ran in the last minute
        32      runnable threads
        0       zombied threads
        4       stopped threads
        178     free threads
        0       mutexes pending
        0       rwlocks pending
        1779    condition variables pending
        1       semaphores pending
        40      user level sobjs pending
        33      shuttle sobjs pending
        1       threads in biowait
        2124    total threads
        
No real memory pressure:

==== reporting memory and swap stats ====

freemem                      291573 pages (2.22G)
avefree                      291497 pages (2.22G)
avefree30                    292025 pages (2.22G)
availrmem (nonswapable)     2536276 pages (19.3G)
availrmem_initial           2578847 pages (19.6G)
swapfs_minfree               325083 pages (2.48G)
spt_used                          0 pages (0)
kernel heap size          328089600 bytes (312M) (8.1% of max (3.75G))
kernel page table size    165046464 bytes (157M)
kernel smap table size      1048576 bytes (1M)

priority paging enabled
dyncachefree                  80588 pages (629M)
cachefree                     80588 pages (629M)
lotsfree                      40294 pages (314M)
desfree                       20147 pages (157M)
minfree                       10073 pages (78.6M)
throttlefree                  10073 pages (78.6M)
physmem                     2600666 pages (19.8G)
total_pages                 2578847 pages (19.6G)


After talking to the field team, they report that they were unable to ping
the domain, and they were unable to get a response at the domain console
via netcon.  At which point they were able to send a break sequence, and typed
in sync.  At this point the sync stopped so from the ssp they issued a sigbcmd
-p 6 and a core file was generated.   The field also reports that
after this hang, they installed the new binaries suggested in the escalation 
(6/4/2002) and 1 day later the domain "hung" again.  This time they were unable 
to capture a core file (due to lack of disk space). 


comments from CPRE regarding findings from core file.

I think I made a mistake in taking out the thread which clock is trying to
wakeup. The correct thread is in %i2 of sleepq_wakeone_chan(). I took
the first thread from sleepq. I had that in mind because I was able to
see cv_waiters as 0x1, but according my old theory , it should have been zero.

According to new theory, cv_waiters being 0x1 is correct because there is one
more tid waiting on this CV. Traversing the t_link of sq_queue.sq_first of
0x104a0b04 will tell you what is right.

So, clock is trying to put tid (040043e80) which is part of callout thread pool.

> 40043e80::findstack
stack pointer for thread 40043e80: 40043c20
[ 40043c20 cv_wait+0x40() ]
  40043c80 callout_thread+0x38()
  40043ce0 thread_start+4()
> 40043e80::print kthread_t ! grep link
    t_link = 0
> 40043e80::print kthread_t ! grep chan
    t_wchan0 = 0
    t_wchan = 0
> 40043e80::print kthread_t ! grep t_lockp
    t_lockp = sleepq_head+0xbc4
> 40043e80::print kthread_t ! grep ops
    t_sobj_ops = 0
    t_copyops = default_copyops
> 

So, clock does appear to be stuck in setbackdq() because
sleepq_wakeone_chan+0x50 takes us to setbackdq().

So, the next question is how long clock  is spinning on the CPU 56 in
cv_signal() -> sleepq_wakeone_chan() -> setbackdq(). Infact where clock
has been spinning in setbackdq().

kadb would be able to tell us more on this. Infact you can try doing "$r" 
on kadb prompt to dump the registers apart from "$C".


Ideally, this hang is different from dispq panic which this esc (537799) is
addressing. I would suggest ES to raise a new escalation for this hang.

It is also quite possible for clock to content on these locks, but make
progress. We do see that last time clock thread was switch out on CPU 56 is 
14 seconds back. It means that clock hasn't switch'ed afterwards causing 
machine to hang. As I showed paths where clock can content on mutex/dispactcher
locks in the previous update, we need to find why clock is behind. If next
crash dump doesn't help us, then we might go for TRAPTRACE kernel or 
instrumented kernel to see what is happening on that CPU. We would most likely
see level-14 interrupt firing and clock spinning in some routines. With the
help of TRAPTRACE kernel we would be able to see where level-14 interrupt
is pinning the clock thread and finally where clock thread is spinning.



Work Around
N/A
Comments
N/A