|
Description
|
The system, a Panther F6800, panics during a "cfgadm configure" operation. The panic is due to dereferencing a null pointer (mmu_fsr) in module "ip." The system had been running specweb_ssl, tpcso, & Oracle9i for about 15 hours with DR operations roughly every 17 minutes.
panic string: BAD TRAP: type=31 rp=2a1006a9520 addr=40 mmu_fsr=0 occurred in module "ip" due to a NULL pointer dereference
The core file is in: /net/mdb.eng/cores/jm146261/
SCApp Version = 5.19.0 B15
See attachments for the system configuration, the DR script that was being run, and the script output.
xxxxx@xxxxx.com 2005-06-09 23:16:39 GMT
We had a look at crash file..
Panic happened because cpu_squeue was NULL at the time tcp_open called
IP_SQUEUE_GET().
> $C
000002a1006a8dc1 tcp_get_conn+0x10(0, 0, 70396400, 30006b64000, 1, 190c000)
000002a1006a8e71 tcp_open+0x1d4(30039913480, 2a1006a9958, 40003, 394,
300322824e8, 2a)
000002a1006a8f21 qattach+0x128(3021e5be560, 2a1006a9958, 40003, 300322824e8, 0,
The panic thread 300397486c0 runs on cpu 523. Note that a DR operation is
in progress for the same cpu(ie 523).(thread 30214f14ca0)
Before ip_squeue_cpu_setup() called from cpu_online could assign the squeue
to cpu_squeue field, the 300397486c0(panic thread) ran and panicked the system.
If we look at cpu_online() it's interesting to see that we do initialization
of some cpu structures after allowing the slave thread to finish off..
cp->cpu_flags &= ~(CPU_QUIESCED | CPU_OFFLINE | CPU_FROZEN |
CPU_SPARE); <====we're flagging slave_startup() here..
start_cpus();
cpu_stats_kstat_create(cp);
cpu_create_intrstat(cp);
lgrp_kstat_create(cp);
cpu_state_change_notify(cp->cpu_id, CPU_ON); <=== the one
responsible to call ip_squeue_cpu_setup()
cpu_intr_enable(cp); /* arch-dep hook */
cyclic_online(cp);
sudheer.abdul- xxxxx@xxxxx.com 2005-06-13 12:42:18 GMT
--
I think cpu_squeue needs to be setup when cpu is getting added to the active list. When we add a CPU to the active list, we must not grab any locks because other CPUs are paused. The state change hooks can't be called before starting the CPUs. So may be it'd be more appropriate to setup cpu_squeue when this CPU is getting added to the active list of CPUs which is done by cpu_add_active_internal().
xxxxx@xxxxx.com 2005-06-14 04:50:16 GMT
xxxxx@xxxxx.com 2005-06-15 23:26:49 GMT
I think the fix can be very simple - we can use CPU_INIT hook to create squeue
and CPU_ON hook to bind it to the CPU. The problem is that CPU_ON is not called for CPU0. Interesting to know whether this is Ok for other consumers of the
CPU_ON hook.
xxxxx@xxxxx.com 2005-07-09 00:09:35 GMT
xxxxx@xxxxx.com 2005-07-09 01:21:33 GMT
I think that CPU_CONFIG is a good place to create squeues. We can create them with
CPU_CONFIG event and bind to CPU later with CPU_ON. This requires a small additional
change: ip_squeue_set_create() should not attempt to bind to off-lined CPUs.
This makes a fix pretty simple.
|