OpenSolaris

Printable Version Enter a New Search
Bug ID 6761784
Synopsis undiagnosable problems on a restart can cause spurious RESOLVED events
State 10-Fix Delivered (Fix available in build)
Category:Subcategory utility:fm
Keywords
Responsible Engineer Stephen Hanson
Reported Against
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_102
Fixed In snv_102
Release Fixed solaris_nevada(snv_102)
Related Bugs
Submit Date 21-October-2008
Last Update Date 6-November-2008
Description
Under some circumstances, we can see spurious RESOLVED events in the log - eg

Sep 08 01:47:33.6537 da430fb8-14e1-6b87-8138-83ddcc440798 FMD-8000-6U Resolved
Sep 08 01:52:25.2156 da430fb8-14e1-6b87-8138-83ddcc440798 FMD-8000-6U Resolved
Sep 08 01:52:27.1484 da430fb8-14e1-6b87-8138-83ddcc440798 FMD-8000-6U Resolved 

It turns out that this is happening when there are unsolved events in the DE's checkpoint file on a restart which the DE then fails to diagnose correctly for some reason or other and so calls fmd_case_solve() with an undiagnosable fault or such like. But at this point during start-up, we are still in the DE's init routine, so the fmd_case_solve() call ends up in the following bit of code in fmd_case_transition()

        /*
         * If the module has initialized, then publish the appropriate event
         * for the new case state.  If not, we are being called from the
         * checkpoint code during module load, in which case the module's
         * _fmd_init() routine hasn't finished yet, and our event dictionaries
         * may not be open yet, which will prevent us from computing the event
         * code.  Defer the call to fmd_case_publish() by enqueuing a PUBLISH
         * event in our queue: this won't be processed until _fmd_init is done.
         */
        if (cip->ci_mod->mod_flags & FMD_MOD_INIT)
                fmd_case_publish(cp, state);
        else {
                fmd_case_hold(cp);
                e = fmd_event_create(FMD_EVT_PUBLISH, FMD_HRT_NOW, NULL, cp);
                fmd_eventq_insert_at_head(cip->ci_mod->mod_queue, e);
        } 

So the event is queued rather than being published, with the state set to SOLVED but nothing in the resource cache. This is then confusing fmd_case_repair_replay_case() which finds no faulty suspects for the case in the resource cache and therefore reports the case as RESOLVED.
Work Around
N/A
Comments
N/A