OpenSolaris

Printable Version Enter a New Search
Bug ID 6739314
Synopsis failed log devices in a pool with spares causes panic on load
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:zfs
Keywords
Responsible Engineer Eric Schrock
Reported Against snv_96
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_98
Fixed In snv_98
Release Fixed solaris_nevada(snv_98)
Related Bugs 6737037
Submit Date 20-August-2008
Last Update Date 10-September-2008
Description
This is not directly related to log devices, but is the most likely
cause.  It can also happen with l2cache devices, and can be triggered
by a vareity of I/O errors during load.

In spa_load(), there are a series of failures that can happen after
spa_load_spares():

	/*
         * Load any hot spares for this pool.
         */
        error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
            DMU_POOL_SPARES, sizeof (uint64_t), 1, &spa->spa_spares.sav_object);
        if (error != 0 && error != ENOENT) {
                vdev_set_state(rvd, B_TRUE, VDEV_STATE_CANT_OPEN,
                    VDEV_AUX_CORRUPT_DATA);
                error = EIO;
                goto out;
        }

	...

        /*
         * Load any level 2 ARC devices for this pool.
         */
        error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
            DMU_POOL_L2CACHE, sizeof (uint64_t), 1,
            &spa->spa_l2cache.sav_object);
        if (error != 0 && error != ENOENT) {
                vdev_set_state(rvd, B_TRUE, VDEV_STATE_CANT_OPEN,
                    VDEV_AUX_CORRUPT_DATA);
                error = EIO;
--------------> goto out;
        }

	...
        if (spa_check_logs(spa)) {
                vdev_set_state(rvd, B_TRUE, VDEV_STATE_CANT_OPEN,
                    VDEV_AUX_BAD_LOG);
                error = ENXIO;
                ereport = FM_EREPORT_ZFS_LOG_REPLAY;
--------------> goto out;
        }
	...

When this function fails, we'll unload and deactivate the spa_t:

                if (error) {
                        /*
                         * We can't open the pool, but we still have useful
                         * information: the state of each vdev after the
                         * attempted vdev_open().  Return this to the user.
                         */
                        if (config != NULL && spa->spa_root_vdev != NULL) {
                                spa_config_enter(spa, RW_READER, FTAG);
                                *config = spa_config_generate(spa, NULL, -1ULL,
                                    B_TRUE);
                                spa_config_exit(spa, FTAG);
                        }
                        spa_unload(spa);
                        spa_deactivate(spa);
                        spa->spa_last_open_failed = B_TRUE;
			...

The problem comes from the fact that in spa_unload(), we
free the spares, but don't reset the number of spares to zero:

        for (i = 0; i < spa->spa_spares.sav_count; i++)
                vdev_free(spa->spa_spares.sav_vdevs[i]);
        if (spa->spa_spares.sav_vdevs) {
                kmem_free(spa->spa_spares.sav_vdevs,
                    spa->spa_spares.sav_count * sizeof (void *));
                spa->spa_spares.sav_vdevs = NULL;
        }

In this case, 'sav_count' will still be set to the number of loaded
spares, but 'sav_vdevs' will be NULL.  Next time we come through
spa_load(), we'll go through the mosconfig path, before loading any
spares:

        if (!mosconfig) {
		...

                spa_config_set(spa, newconfig);
--------------> spa_unload(spa);
                spa_deactivate(spa);
                spa_activate(spa);

                return (spa_load(spa, newconfig, state, B_TRUE));
        }

In our second trip through spa_unload(), we'll notice the non-zero
spare count and attempt to execute the same bit of code:

        for (i = 0; i < spa->spa_spares.sav_count; i++)
                vdev_free(spa->spa_spares.sav_vdevs[i]);

But at this point, 'sav_vdevs' is still NULL, and we'll panic
dereferencing a NULL pointer.  The solution is to zero out
'sav_count' as part of spa_unload() to bring it back to a
pristene state.
Work Around
N/A
Comments
N/A