OpenSolaris

Printable Version Enter a New Search
Bug ID 6773561
Synopsis VM gives pages which have p_fsdata populated erroneously
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:vm
Keywords
Responsible Engineer Prakash Sangappa
Reported Against 3.2u1_16 , 3.2u2_06 , 3.2u2_08 , s10u5_fcs
Duplicate Of
Introduced In solaris_2.3
Commit to Fix snv_106
Fixed In snv_106
Release Fixed solaris_nevada(snv_106) , solaris_10u7(s10u7_06) (Bug ID:2170215)
Related Bugs 6670562
Submit Date 19-November-2008
Last Update Date 16-January-2009
Description
The p_fsdata field of page_t is used by filesystems for their need. Interpretation
of the field varies with different filesystems. PxFS uses this field to indicate if
a page is mapping a hole in the file. For PxFS this field makes sense only if the
underlying filesystem is UFS.

During a multi-node write test for PxFS over VxFS on x64 machines running S10,
we encountered a corruption. This is reproducible at will but the steps are time 
consuming. On analysing cores from a cluster that had hit this bug, I found that
the page which mapped the corrupted data had to have p_fsdata bit set to 1. This
can be verified from the pxfs debug buffer. PxFS never sets p_fsdata if underlying 
filesystem is VxFS. It had to be set before page_lookup() gave us the page.

To verify whether this can happen, I looked at all pages in the core and found that
quite a few pages had the bit set. From kvm pages to program text had p_fsdata set
to 1. Other cluster nodes did not exhibit this symptom. Looking at the vm code showed 
that only page_relocate_hash() manipulates this field. There was no DR done on the 
system and as per my understanding page_relocate_hash() needn't be called.

The cores are available at:
/net/zuma-1/logs/cores/other-cores/6750350/redo

The core which shows the problem is in predo3:
/net/zuma-1/logs/cores/other-cores/6750350/redo/predo3

The PxFS debug log which shows the problem is 
/net/zuma-1/logs/cores/other-cores/6750350/redo/pxfs.dbg.3
and 
/net/zuma-1/logs/cores/other-cores/6750350/redo/pxfs.dbg.4

In pxfs.dbg.4 we can see at lines 8187 and 8190 we can see the debug prints from the
space allocation request on the server that caused the corruption.

th ffffffffa487c260 tm  62208622: vx_alloc_data vp fffffe82a7e1bd00 off 1000 len 1000

th ffffffffa487c260 tm  62208625: bmap(ffffffff89e3ee10) off 1000 length 1000 len 1000 siz 14000

The corresponding debug print on the client shows that the p_fsdata but was set. In
pxfs.dbg.3 line 110365

th ffffffff9d607400 tm 303349490: getapage(ffffffffadfd4800) bmap existing page off 1000

There are unique debug prints for all the cases when PxFS sets this field. The
debug buffers does not show such an entry.

Looking into the core for pages with the p_fsdata set, here are three of them.

> 0xffffffffb9cd8100::print vnode_t
{
...
    v_vfsp = root
    v_stream = 0
    v_type = 1 (VREG)
...
    v_path = 0xffffffffb972a880 "
/opt/SUNWscts/tset/pxfs/tests/tc_all_reboot/tc_all_reboot"
...
}

> 0xffffffffb9cd8100::walk page | ::print page_t p_fsdata
...
p_fsdata = 0x1
...
p_fsdata = 0x1
...
p_fsdata = 0x1
...

> kvp::walk page | ::print page_t p_fsdata ! grep 0x1
p_fsdata = 0x1
p_fsdata = 0x1
p_fsdata = 0x1
p_fsdata = 0x1
p_fsdata = 0x1
p_fsdata = 0x1
p_fsdata = 0x1
...

> 0xffffffffb929db40::walk page | ::print page_t p_fsdata ! grep 0x1 | wc -l
     206
Work Around
N/A
Comments
Random pages have p_fsdata bit set to non-zero value. This can cause filesystems
that make use of the bit to initiate a wrong action on the page.
apart from PxFS (which I wasn't awware of using this till today) the only existing consumer of the page 'p_fsdata' member is the NFS _client_ code, see

usr/src/uts/common/nfs/rnode.h

    111  * The various values for the commit states.  These are stored in
    112  * the p_fsdata byte in the page struct.
    113  */
    114 #define	C_NOCOMMIT	0	/* no commit is required */
    115 #define	C_COMMIT	1	/* a commit is required so do it now */
    116 #define	C_DELAYCOMMIT	2	/* a commit is required, but can be delayed */

I wonder if your magic value of '1' is actually C_COMMIT and the page belongs
to the nfs client code, ie. is backed up by an rnode, whats the vnode ops
that belong to this vnode ?
In the core I have all pages with p_fsdata set to 1 were backed by ufs vnodes.
The curious part is there was no other value than 0 or 1. For NFS, 1 is C_COMMIT
and 2 is DELAY_COMMIT. For PxFS, 1 is PXFS_HOLE, it does not set any other value.  Irrespective of the value, for UFS backed pages, p_fsdata should not be non-zero.
The p_fsdata in the page_t is for filesystem use. In the ON consolidation only NFS 
filesystem uses it.  In the dump I seen vxfs filesystem in use. Does vxfs use the
p_fsdata member other the pxfs? 

It could be possible that some filesystem set the p_fsdata and  did not clear it 
before freeing/destroying the page. When the page is destroyed (page_do_hashout()) 
the p_fsdata member is not cleared.  Later on this free page get allocated to be used
the p_fsdata will have a stale value and that is how you could be seeing a ufs file page
having p_fsdata = 0x1.  I am searching for a free page having p_fsdata =0x1 in
the dump which would confirm this possibility.  If so we could try and capture the
thread stack freeing the page using dtrace. 

Other  possibility is that the page with p_fsdata set got relocated and the value got
copied over. Now this old page gets reused. Needs further investigation on this front.

Is there a reproducible setup available?
I could find  plenty of pages on both page freelist and cache list which have  p_fsdata = 0x1. 

Ex free page from freelist

> ffffffff8568e0ca-4a::print page_t
{
    p_offset = 0xffffffffffffffff
    p_vnode = 0 <-- no vnode
  ..
    p_fsdata = 0x1
    p_state = 0x80 <--P_FREE
 ..
}


The pages on the page cachelist are pages that are still associated with
a vnode but free to be used by others. Note the cachelist pages get allocated 
where there are no pages available on the freelist. So far I have only found
ufs pages on the cachelist that have p_fsdata set.

Ex. - free page from the cachelist.

> fffffffff988d642-4a::print page_t
{
    p_offset = 0x41000
    p_vnode = 0xffffffffaa431980 <--ufs vnode
    ..
    p_fsdata = 0x1
    p_state = 0x80 <-- P_FREE
    ..
}

> 0xffffffffaa431980::print vnode_t v_op
v_op = 0xffffffff8853ba80
> 0xffffffff8853ba80/16P
0xffffffff8853ba80:             0xfffffffffbbf61d6ufs_open        
                ufs_close       ufs_read        ufs_write       ufs_ioctl
                fs_setfl        ufs_getattr     ufs_setattr     ufs_access
                ufs_lookup      ufs_create      ufs_remove      ufs_link
                ufs_rename      ufs_mkdir       


If the ufs pages have p_fsdata = 0x1, then it will not get cleared
when these pages get put on the cachelist(freed) or freelist (distroyed).  
Subsequently these free pages can be allocated to be used by other 
consumer like kvp, vxfs  file system etc.  

I am not familiar with how PXFS works. Could it be possible that PXFS 
propogates the p_fsdata value to ufs pages?  This need to be investigate.

Although it is possible that these ufs pages already had p_fsdata set
when they where allocated. In that case someone else freed the pages without
clearing the p_fsdata.

Using dtrace we should be able to get the stack trace of the thread which is 
page_free()/page_destroying a page with p_fsdata  still set.
PxFS sets the PXFS_HOLE bit in p_fsdata field only for the client
side. There is no chance of PxFS client accessing UFS pages on the
server side. If PxFS or NFS freed pages with the p_fsdata but set,
shouldn't the freeing procedure clear the field? If not at the time of
release, shouldn't it be cleared at when a page is picked up from the
free-list? I ask that because, when you ask VM for a page, you expect
a page with default values or values appropriate for the mapping.

I can't guarantee that there is no bug in PxFS which causes it to
release pages without clearing the p_fsdata bit. In this particular
core, the debug bufers indicate that PxFS has not executed the code
that sets the p_fsdata field.

This is the first place it is set:
http://src.opensolaris.org/source/xref/ohac/ohac/usr/src/common/cl/pxfs/client/aio_callback_impl.cc#616
which is preceded by the debug print at line 604-608.

This is the second place:
http://src.opensolaris.org/source/xref/ohac/ohac/usr/src/common/cl/pxfs/client/pxreg.cc#1337
which is preceded by the debug print at 1323-1329

Neither of these debug prints show up in the debug buffer. That is
expected with PxFS/VxFS. The client will never see a hole even if the
underlying file system has a hole.
While investigating CR # 6652719 I suspected incorrect handling of
p_fsdata which, does not appear to be the cause for the corruption in that
case, but may be relevant here. In particular, page_relocate_hash() copies 
p_fsdata between target and replacement pages outside the protection of 
vph mutex. Check out the last entry in evaluation #17 for
http://monaco.sfbay/detail.jsf?cr=6652719#Evaluation
Yes, I think page_do_hashout() could clear the p_fsdata as that should be
irrelevant to the new consumer of the page.One of the outcome of this bug 
should result in page_do_hashout()  either clearing the p_fsdata member or 
assert to ensure that p_fsdata is cleared before the page gets freed.

However, we need to root cause and findout who is freeing these pages 
with p_fsdata set and determine if it that may still pos a problem for that
filesystems. Infact adding an assert to page_free/page_destroy, to ensure that 
p_fsdata is 0. would help.
This is how pages with p_fsdata but set are left in memory.

PxFS supports read aheads. The pages created for mapping the address range
covered by the read ahead must be left in memory for later consumption. In
the case of UFS these read aheads could be issued on an address range that
covers a hole. PxFS sets the p_fsdata field of such a page with the value
PXFS_HOLE. It then issues a pvn_read_done() on those pages to unlock the
pages.

These pages can and will be picked up by a read or write thread of PxFS.
PxFS needs to know if the page covers a hole and checks the p_fsdata for
the presence PXFS_HOLE. This requirement makes it impossible for PxFS to
clear p_fsdata field before freeing pages.

VM can reclaim pages from the free list, which is also a cache list, and
re-use it. If there is memory pressure, pages from the free list can be
re-used for other mappings than what exists on the page. When the re-use
happens on a PxFS page populated by read ahead of a hole, the page will
have p_fsdata field set to non-zero for a non-PxFS filesystem.

With NFS this is not an issue since p_fsdata is used only to track pages that
need COMMIT-ing. Pages populated by a read ahead will never need a commit.
Further, NFS does not have the problem of supporting holes while reading.