|
Description
|
Under some circumstances, we can see spurious RESOLVED events in the log - eg
Sep 08 01:47:33.6537 da430fb8-14e1-6b87-8138-83ddcc440798 FMD-8000-6U Resolved
Sep 08 01:52:25.2156 da430fb8-14e1-6b87-8138-83ddcc440798 FMD-8000-6U Resolved
Sep 08 01:52:27.1484 da430fb8-14e1-6b87-8138-83ddcc440798 FMD-8000-6U Resolved
It turns out that this is happening when there are unsolved events in the DE's checkpoint file on a restart which the DE then fails to diagnose correctly for some reason or other and so calls fmd_case_solve() with an undiagnosable fault or such like. But at this point during start-up, we are still in the DE's init routine, so the fmd_case_solve() call ends up in the following bit of code in fmd_case_transition()
/*
* If the module has initialized, then publish the appropriate event
* for the new case state. If not, we are being called from the
* checkpoint code during module load, in which case the module's
* _fmd_init() routine hasn't finished yet, and our event dictionaries
* may not be open yet, which will prevent us from computing the event
* code. Defer the call to fmd_case_publish() by enqueuing a PUBLISH
* event in our queue: this won't be processed until _fmd_init is done.
*/
if (cip->ci_mod->mod_flags & FMD_MOD_INIT)
fmd_case_publish(cp, state);
else {
fmd_case_hold(cp);
e = fmd_event_create(FMD_EVT_PUBLISH, FMD_HRT_NOW, NULL, cp);
fmd_eventq_insert_at_head(cip->ci_mod->mod_queue, e);
}
So the event is queued rather than being published, with the state set to SOLVED but nothing in the resource cache. This is then confusing fmd_case_repair_replay_case() which finds no faulty suspects for the case in the resource cache and therefore reports the case as RESOLVED.
|