OpenSolaris

Printable Version Enter a New Search
Bug ID 6283577
Synopsis cfgadm configure panic - NULL pointer (mmu_fsr) dereferenced in module "ip"
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:sched
Keywords DR | fireengine | onnv_triage | panther | s10u1-req | squeue
Responsible Engineer Alexander Kolbasov
Reported Against
Duplicate Of
Introduced In solaris_10
Commit to Fix snv_23
Fixed In snv_23
Release Fixed solaris_nevada(snv_23) , solaris_10u1(s10u1_15) (Bug ID:2128984)
Related Bugs 6287661
Submit Date 9-June-2005
Last Update Date 12-January-2007
Description
The system, a Panther F6800, panics during a "cfgadm configure" operation.  The panic is due to dereferencing a null pointer (mmu_fsr) in module "ip."  The system had been running specweb_ssl, tpcso, & Oracle9i for about 15 hours with DR operations roughly every 17 minutes.

panic string:	BAD TRAP: type=31 rp=2a1006a9520 addr=40 mmu_fsr=0 occurred in module "ip" due to a NULL pointer dereference

The core file is in:  /net/mdb.eng/cores/jm146261/

SCApp Version =  5.19.0 B15

See attachments for the system configuration, the DR script that was being run, and the script output.



 xxxxx@xxxxx.com 2005-06-09 23:16:39 GMT
We had a look at crash file..
Panic happened because cpu_squeue was NULL at the time tcp_open called
IP_SQUEUE_GET().
> $C
000002a1006a8dc1 tcp_get_conn+0x10(0, 0, 70396400, 30006b64000, 1, 190c000)
000002a1006a8e71 tcp_open+0x1d4(30039913480, 2a1006a9958, 40003, 394, 
300322824e8, 2a)
000002a1006a8f21 qattach+0x128(3021e5be560, 2a1006a9958, 40003, 300322824e8, 0, 
The panic thread 300397486c0 runs on cpu 523. Note that a DR operation is
in progress for the same cpu(ie 523).(thread  30214f14ca0)

Before ip_squeue_cpu_setup() called from cpu_online could assign the squeue
to cpu_squeue field, the 300397486c0(panic thread) ran and panicked the system.
If we look at cpu_online() it's interesting to see that we do initialization
of some cpu structures after allowing the slave thread to finish off..
	cp->cpu_flags &= ~(CPU_QUIESCED | CPU_OFFLINE | CPU_FROZEN |
           	    CPU_SPARE);	<====we're flagging slave_startup() here..
                start_cpus();
                cpu_stats_kstat_create(cp);
                cpu_create_intrstat(cp);
                lgrp_kstat_create(cp);
                cpu_state_change_notify(cp->cpu_id, CPU_ON); <=== the one
			responsible to call ip_squeue_cpu_setup()
                cpu_intr_enable(cp);    /* arch-dep hook */
                cyclic_online(cp);


sudheer.abdul- xxxxx@xxxxx.com 2005-06-13 12:42:18 GMT

--
I think cpu_squeue needs to be setup when cpu is getting added to the active list. When we add a CPU to the active list, we must not grab any locks because other CPUs are paused. The state change hooks can't be called before starting the CPUs. So may be it'd be more appropriate to setup cpu_squeue when this CPU is getting added to the active list of CPUs which is done by cpu_add_active_internal().
 xxxxx@xxxxx.com 2005-06-14 04:50:16 GMT

 xxxxx@xxxxx.com 2005-06-15 23:26:49 GMT

I think the fix can be very simple - we can use CPU_INIT hook to create squeue
and CPU_ON hook to bind it to the CPU. The problem is that CPU_ON is not called for CPU0. Interesting to know whether this is Ok for other consumers of the
CPU_ON hook.

 xxxxx@xxxxx.com 2005-07-09 00:09:35 GMT
 xxxxx@xxxxx.com 2005-07-09 01:21:33 GMT
I think that CPU_CONFIG is a good place to create squeues. We can create them with
CPU_CONFIG event and bind to CPU later with CPU_ON. This requires a small additional 
change: ip_squeue_set_create() should not attempt to bind to off-lined CPUs.

This makes a fix pretty simple.
Work Around
N/A
Comments
N/A