[07/18/96]
I was running vmstress. On some of it tests it core dumps. Core dumping
is more or less OK for vmstress according to its README file. But core
dumping causes kernel stack overflow and panic. Here's the stack
trace:
{0} ok ctrace
PC: f005ac6c
Last leaf: jmpl f005b18c from 100079a8 client_handler+38
0 w %o0-%o5: (10000000 16 f0000000 1 3 1 )
call 10007970 client_handler from 10041b00 p1275_sparc_cif_handler+20
1 w %o0-%o5: (f005b18c 104077b8 0 0 3 10407704 )
call 10041ae0 p1275_sparc_cif_handler from 1003e914 prom_enter_mon+34
2 w %o0-%o5: (104077b8 1e e 10420248 0 10421a98 )
call 1003e8e0 prom_enter_mon from 10025330 debug_enter+a0
3 w %o0-%o5: (1042b400 0 0 100074bc 513c0580 3037c860 )
call 10025290 debug_enter from 10024114 do_panic+180
4 w %o0-%o5: (0 0 10006ed0 0 10006ed4 44 )
call 10023f94 do_panic from 10023f84 panic+1c
5 w %o0-%o5: (10406a18 ffffffc0 10413668 0 ffffffff 10408c00 )
call 10023f68 panic from 100070ac sys_tl1_panic+8
6 w %o0-%o5: (104069d4 2 0 0 0 50ca0d90 )
call 10036610 splx from 10046328 disp_getwork+c8
7 w %o0-%o5: (a db 5 0 0 0 )
call 10046260 disp_getwork from 10044434 disp+b0
8 w %o0-%o5: (0 10423004 0 0 ffffffff 10421590 )
call 10044384 disp from 10044770 swtch+124
9 w %o0-%o5: (0 1041b290 1045dd68 0 ffffffff ffffffff )
call 1004464c swtch from 1006f54c genunix:cv_timedwait_sig+28c
a w %o0-%o5: (fcda4 513c0580 10421590 1045dd68 513c0580 535584e0 )
call 1006f2c0 genunix:cv_timedwait_sig from 507536b0 rpcmod:clnt_cots_kcallit+5f4
b w %o0-%o5: (5063067c 507b5b54 fd18c 10463c00 2001788c 0 )
jmpl 507530bc rpcmod:clnt_cots_kcallit from 51025db0 nfs:rfscall+384
c w %o0-%o5: (50630670 5075cc18 5075b100 fcda4 50630660 50630660 )
{0} ok 3037c288 d8 + .stacktrace
call 510258bc nfs:rfs3call from 51036f2c nfs:nfs3write_rpccall+a0
( 506a99a8 7 5103f4a8 3037c568 5103f6e0 3037c4d0 )
call 51036e8c nfs:nfs3write_rpccall from 51037028 nfs:nfs3write+90
( 506a99a8 3037c568 3037c4d0 507a37b0 52973028 506a99a8 )
call 51036f98 nfs:nfs3write from 5103b348 nfs:nfs3_bio+3f8
( 512d4af4 524fc000 0 8000 2000 507a37b0 )
call 5103af50 nfs:nfs3_bio from 51036e30 nfs:nfs3_rdwrlbn+d0
( 2000 3037c6b4 532dd3b4 510460c4 10464304 51048000 )
call 51036d60 nfs:nfs3_rdwrlbn from 5103c594 nfs:nfs3_sync_putapage+24
( 512d4af4 106fb800 0 8000 532dd398 10100 )
call 5103c570 nfs:nfs3_sync_putapage from 5103c530 nfs:nfs3_putapage+378
( 512d4af4 106fb800 0 8000 2000 10000 )
jmpl 0 from 100f3980 genunix:pvn_vplist_dirty+3ac
( 512d4af4 106fb800 0 0 10000 507a37b0 )
call 100f35d4 genunix:pvn_vplist_dirty from 51022c90 nfs:nfs_putpages+12c
( 10463e0c 10456434 0 5103c1b8 10000 507a37b0 )
call 51022b64 nfs:nfs_putpages from 5103c180 nfs:nfs3_putpage+a8
( 512d4af4 0 0 2 512d4b3c 507a37b0 )
jmpl 10015ed4 gen_clk_int from 51020288 nfs:nfs_purge_caches+98
( 512d4af4 0 0 0 10000 507a37b0 )
call 510201f0 nfs:nfs_purge_caches from 51020510 nfs:nfs_cache_check+f0
( 512d4af4 507a37b0 0 0 512d4b3c 512d4ae8 )
call 51020420 nfs:nfs_cache_check from 51021094 nfs:nfs3_getattr_otw+1a0
( 512d4af4 1 1 0 24ab720 3037cb70 )
call 51020ef4 nfs:nfs3_getattr_otw from 510201b4 nfs:nfs3_validate_caches+bc
( 0 3037cc28 507a37b0 5103ec70 5103ec18 0 )
call 510200f8 nfs:nfs3_validate_caches from 5103b608 nfs:nfs3_getpage+48
( 512d4af4 507a37b0 512d4c28 512d4ae8 512d4b3c 512d4af4 )
jmpl 10015ed4 gen_clk_int from 100a64e8 genunix:segmap_fault+160
( 512d4af4 0 0 2000 3037cddc 3037cdd0 )
jmpl 10015ed4 gen_clk_int from 100eed44 genunix:as_fault+400
( 50733fc8 50975f98 40ffc000 2000 0 2 )
call 100ee944 genunix:as_fault from 100304f8 pagefault+34
( 50975f98 50975f98 2000 0 40ffc000 40ffc000 )
call 100304c4 pagefault from 1002e328 trap+700
( 40ffc000 0 2 1 1042a7a4 51611680 )
call 1002dc28 trap from 10020690 sfmmu_tsb_miss+624
( 3037d0c8 10000 4 1000bdcc 0 0 )
???? from 100074bc prom_rtt+118
( 1042c000 0 0 50733fc8 0 10574848 )
call 1000d3ac bcopy+1528 from 1000d318 bcopy+1494
( 7 3c 7 5613 2080 0 )
call 1000bd68 kcopy from 10092984 genunix:uiomove+f0
( 1241d4b8 40ffc638 4 1f8 1 10466548 )
call 10092894 genunix:uiomove from 51022a54 nfs:writerp+1f8
( 40ffc638 3037d5a0 1 1fc 3037d598 3037d5c0 )
call 5102285c nfs:writerp from 51036b54 nfs:nfs3_write+25c
( 1fc 3037d5a0 0 638 1fc 40ffc638 )
jmpl 10015ed4 gen_clk_int from 100f7cb4 genunix:vn_rdwr+c8
( 1fc 0 40ffc000 fffffffd 506a99a8 512d4ae8 )
call 100f7bec genunix:vn_rdwr from 507ffe20 elfexec:elfnote+cc
( 1 512d4af4 0 507a37b0 0 638 )
call 507ffd54 elfexec:elfnote from 50800ccc elfexec:write_old_elfnotes+1f0
( 512d4af4 3037d814 1 507a37b0 53419af0 0 )
call 50800adc elfexec:write_old_elfnotes from 508001ec elfexec:elfcore+39c
( 0 51611568 513c0580 3037d814 0 7fffffff )
jmpl 10015ed4 gen_clk_int from 10070068 genunix:core+258
( 0 1 52af3040 0 20 52af3000 )
call 1006fe10 genunix:core from 100b3060 genunix:psig+358
( 1040b518 524e3ce8 507a37b0 0 7fffffff b )
call 100b2d08 genunix:psig from 1002f4f4 trap_cleanup+1ec
( 2 b 10460000 b 0 400 )
call 1002f308 trap_cleanup from 1002f290 trap+1668
( 3037dae8 6 3037da60 1 3 51611568 )
call 10023f68 panic from 100070ac sys_tl1_panic+8
( 3037dae8 10000 5 148f0 524e3ce8 10000 )
XXXXXXX from 148bc
( 8 0 76db46a8 2f788 12d008 14770 )
Solaris per thread kernel stack is 8K (actually less than 7K). The actual overflow happened as follows. We were almost out of stack while doing a context switch. Top of the stack was 3037dfff and were already at 3037c0d8. We got a level 5 interrupt and inside _sys_trap bought a new window at TL=1. And when it did:
have_win+30 stx %l2, [%l7] to store %tstate we got onto stack's red zone and then got into tl1 panic code.
Dump of the window where have_win above was running.
{0} ok 104079f8 40 ldump
104079f8: 10044440 10044444 80001e04 10009ed4 .. xxxxx@xxxxx.DD........
10407a08: 506a99a8 512d4c4c 0 3037bfd0 Pj..Q-LL....07..
10407a18: 0 10423004 0 0 .....B0.........
10407a28: ffffffff 10421590 10407a58 10044434 ..... xxxxx@xxxxx.D4
As you see %l7 is 0x3037bfd0. i,.e in stack's red zone.
Work Around
[ali 4/22/97]:
The immediate problem can be workaround by increasing lwp_default_stksize.
This limits the number of kernel threads the system can support though.