'fmadm faulty' output currently requires a bit more knowledge of fma
detail than we should expect from every admin:
# fmadm faulty
STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
degraded mem:///motherboard=0/chip=0/memory-controller=0/dimm=0/rank=0
4521d642-c1f3-4ad7-8923-d317d478056a
-------- ----------------------------------------------------------------------
So when was this diagnosed? What fault does it have? How serious is that fault?
What is the FRU to be replaced? What does "degraded" mean? What is that
dirty long string "4521d642-c1f3-4ad7-8923-d317d478056a" UUID mean?
In the corresponding console output at diagnosis time (or on restart with the
fault still present) we'd have something like:
Oct 23 12:44:10 va64-x2100g-gmp03 REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u <EVENT_ID> to identify the module.
It sucks that the admin has to cut and paste the event id and follow this
message in order to answer the above questions.
While fmadm(1m) lists the command line options as Evolving and human-readable
output as Unstable we should probably consider leaving 'fmadm faulty'
output unchanged and introduce a new command line option. Our suggestion
is 'fmadm status' and a mock-up of the output is as follows:
# fmadm status
------------------------------------- ------------ --------- -----------------
EVENT-ID MSG-ID SEVERITY TIME
------------------------------------- ------------ --------- -----------------
132b2dfa-903d-e81e-bcb4-f98177e762f3 AMD-8000-5M Major Jun 28 03:33:09
Fault : fault.cpu.amd.l2cachedata
Certainty : 75%
Affects : cpu:///cpuid=0
Status: Removed from service
FRU : hc:///motherboard=0/chip=0
Label: "CPU0"
Fault : fault.cpu.amd.l2cachetag
Certainty : 25%
Affects : cpu:///cpuid=0
Status: Removed from service
FRU : hc:///motherboard=0/chip=0
Label: "CPU0"
------------------------------------- ------------ --------- -----------------
EVENT-ID MSG-ID SEVERITY TIME
------------------------------------- ------------ --------- -----------------
27c3a201-f410-610e-c88e-ceac8195ee93 AMD-8000-3K Major Oct 13 2004
Fault : fault.memory.dimm_ck
Certainty : 100%
Affects : mem:///motherboard=0/chip=2/memory-controller=0/dimm=0
Status: In service but degraded
FRU : hc:///motherboard=0/chip=2/memory-controller=0/dimm=0
Label: "CPU2 DIMM0"
------------------------------------- ------------ --------- -----------------
EVENT-ID MSG-ID SEVERITY TIME
------------------------------------- ------------ --------- -----------------
d4671fa8-2a01-68da-fe57-9c09f5a717a2 FMD-8000-2K Minor Jul 04 1776
Fault : defect.fmd.module
Certainty : 100%
Affects : fmd:///module/eft
Status: Removed from service
Notes:
o Perhaps combine the two fru lines depending on whether we know a label or
not. ie, this:
FRU : hc:///motherboard=0/chip=2/memory-controller=0/dimm=0
Label: "CPU2 DIMM0"
becomes
FRU : "CPU2 DIMM0"
if we have a fru label available, otherwise it would be
FRU : hc:///motherboard=0/chip=2/memory-controller=0/dimm=0
o Sort most severe cases to top
o Undecided whether long fmri strings should wrap to column 0 or
wrap indented; the former may look ugly but facilitates
copy-paste.
The above output content and format is not set in stone and we're open to
suggestions and discussion. The main constraint is that it be achievable without
significant overhaul to fma infrastructure that would cause implementation to
be delay - we'd like to ship this soon.