OpenSolaris

Printable Version Enter a New Search
Bug ID 6664521
Synopsis performance hit when size prioritized over lgroup placement
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:vm
Keywords batoka-perf | opl-perf
Responsible Engineer Michael Corcoran
Reported Against
Duplicate Of
Introduced In solaris_10u2
Commit to Fix snv_86
Fixed In snv_86
Release Fixed solaris_nevada(snv_86) , solaris_10u6(s10u6_01) (Bug ID:2160701)
Related Bugs 6675015
Submit Date 18-February-2008
Last Update Date 10-June-2008
Description
[This change request best read in a mono-spaced font]

Brief summary

   The default behavior of the kernel appears to be to prioritize page
   size over the home lgroup for memory allocation.  This is not always
   the best choice.  This behavior should be customer-settable.

Slightly longer summary: three intertwined issues regarding fallback

   When the NUMA VM code is unable to allocate a page of the size and
   location requested, the operating system needs to become more
   intelligent about fallbacks.  As things stand today:

    - 1. The kernel is too likely to make the wrong fallback decision.

    - 2. If subjected to a long lasting workload, the kernel does not
      appear to learn to improve the fallback.

    - 3. Users do not have a way to teach it to make better fallback
      decisions.

Politeness interrupt

   It is understood that this CR touches on long standing issues, and
   that work has been done over the years on these issues.  That work is
   appreciated.  Thank you for progress thus far, and apologies if the
   above seems too sharply stated.  But more work is needed.

Only the present

   Rather than getting lost in a twisty little maze of change request
   history, this CR intentionally confines itself to the present moment:
   Solaris 10 Update 4 plus patches.  Details of patch levels are in the
   comments, along with a description of behavior on an unannounced
   processor.

Bug or RFE?  

   This report is classified as a bug on the grounds that it is the _job_
   of the NUMA VM code to make intelligent decisions about page placement,
   and the submitter respectfully suggests that this job is not being
   fulfilled as intelligently as it should be.  
   
   Thank you for, at least, considering this point of view.

Details follow on the three numbered points from above.

=================================
1. The kernel is too likely to make the wrong fallback decision.

   What should be done when the requested page size is not available in
   the requested locality group?  Roughly, it would seem that there are
   three basic fallback alternatives:  

         - (a) Settle for a different size.

         - (b) Settle for a different location.

         - (c) Attempt to manufacture the requested size in the requested
           location.

   From external observation, it appears that the OS often (typically?
   always?) picks (b).  If this decision was based on some analysis of
   costs of TLB handling vs. cost of remote accesses, perhaps this cost
   should be considered afresh, in light of times on contemporary
   processors, which have worked to reduce costs of TLB misses.

   Presumably the cost is highly dependent on what the program does with
   the page.  If you're going to allocate it, peek at a few bytes, and
   throw it away, the cost calculation is very different than if you
   allocate it and then use all of it.

   To give an example with a simple model that has round numbers, suppose
   that taking a TLB miss costs on the order of 1000 ns, local pages are
   accessible in 100ns, and remote pages are accessible in 200ns.  

        [Aside: Yes, yes, these round numbers aren't right.  Fix them with
        better numbers, if you wish!  When you fix them, though, please
        take into account that pure lmbench numbers may understate the
        costs of remote accesses, because lmbench may effectively assume
        that the remote processor and memory have nothing better to do
        than to satisfy your request.]

   Sparse access (worst case)

      A program requests a 4MB page and touches 64 bytes on it, scattered
      evenly across the page (that is, 64KB apart).  In our simple model,
      the costs would be one of these:

          7,400 ns   Local bigpage  64 local accesses, 1 TLB miss
         70,400 ns   Fallback (a)   64 local access, 64 TLB misses
         13,800 ns   Fallback (b)   64 remote accesses, 1 TLB miss

      For this example, fallback (b) is indeed the better choice.

   Dense access (NOT worst case)

      Suppose a program requests a 4MB page and updates all of it, reading
      64-byte cache lines and writing them.  This would be 131072 accesses
      to the page.  In our simple model, the costs would be:

          13,108,200 ns   Local bigpage  131072 local accesses, 1 TLB miss
          13,171,200 ns   Fallback (a)   131072 local accesses, 64 TLB misses
          26,215,400 ns   Fallback (b)   131072 remote accesses, 1 TLB miss

      For this example, we're better off with fallback (a).

   Dense access (worst case)

      Please note that the previous example is not "worst case" for dense
      access.  In the worst case, we would indefinitely iterate over the
      data -- and certainly there are real-life algorithms that do
      repeated accesses to relatively small address spaces.  

      As a program iterates indefinitely over a small dataset, fallback (a)
      becomes indefinitely better than fallback (b).

=================================
2. If subjected to a long lasting workload, the kernel does not
appear to learn to improve the fallback.

   Workloads with heavy demand for large pages do not appear to encourage
   the operating system to increase the supply of local bigpages, even
   when the workload persists for days.  For example, a workload was run
   with 12 applications that made varying, but sometimes heavy, demand for
   4MB pages.  The largest of the 12 applications requests ~950 MB.  All
   the applications were compiled with -xpagesize=4M.

   The tested system had 18 locality groups, each with 16 GB, _except_ for
   the locality group with Solaris itself, which had 32GB.  Notice in this
   extract from 'lgrpinfo' that each leaf lgroup has about the same amount
   of memory allocated:

        Memory: installed 32768 Mb, allocated 8577 Mb, free 24191 Mb
        Memory: installed 16384 Mb, allocated 8307 Mb, free 8077 Mb
        Memory: installed 16384 Mb, allocated 8344 Mb, free 8040 Mb
        Memory: installed 16384 Mb, allocated 8286 Mb, free 8098 Mb
        Memory: installed 16384 Mb, allocated 8289 Mb, free 8095 Mb
        Memory: installed 16384 Mb, allocated 8267 Mb, free 8117 Mb
        Memory: installed 16384 Mb, allocated 8251 Mb, free 8133 Mb
        Memory: installed 16384 Mb, allocated 9293 Mb, free 7091 Mb
        Memory: installed 16384 Mb, allocated 8270 Mb, free 8114 Mb
        Memory: installed 16384 Mb, allocated 8228 Mb, free 8156 Mb
        Memory: installed 16384 Mb, allocated 8219 Mb, free 8165 Mb
        Memory: installed 16384 Mb, allocated 8185 Mb, free 8199 Mb
        Memory: installed 16384 Mb, allocated 8275 Mb, free 8109 Mb
        Memory: installed 16384 Mb, allocated 8208 Mb, free 8176 Mb
        Memory: installed 16384 Mb, allocated 8191 Mb, free 8193 Mb
        Memory: installed 16384 Mb, allocated 8277 Mb, free 8107 Mb
        Memory: installed 16384 Mb, allocated 8218 Mb, free 8166 Mb
        Memory: installed 16384 Mb, allocated 8217 Mb, free 8167 Mb

   In each locality group, 8 user processes were run, EXCEPT in the first
   lgroup, where only 7 were run.  Thus the first lgroup is more lightly
   loaded, AND has plenty of free memory (24191 MB above).  Nevertheless,
   even after more than 24 hours, processes run in the first lgroup (and
   only processes run in the first lgroup) show page scattering.  For
   example:

       Address    Bytes Pgsz Mode   Lgrp Mapped File
      00800000    4096K   4M rwx--    9   [ heap ]
      00C00000    4096K   4M rwx--    3   [ heap ]
      01000000    4096K   4M rwx--   10   [ heap ]
      01400000    4096K   4M rwx--    7   [ heap ]
      01800000    4096K   4M rwx--   15   [ heap ]
      01C00000    4096K   4M rwx--    3   [ heap ]
      02000000    8192K   4M rwx--   12   [ heap ]
      02800000    4096K   4M rwx--   14   [ heap ]
      02C00000    4096K   4M rwx--    8   [ heap ]
      03000000    4096K   4M rwx--    4   [ heap ]
      03400000    4096K   4M rwx--    6   [ heap ]
      03800000    4096K   4M rwx--    4   [ heap ]
      03C00000    4096K   4M rwx--    9   [ heap ]
      04000000    4096K   4M rwx--   14   [ heap ]
      04400000    4096K   4M rwx--    5   [ heap ]
      04800000    8192K   4M rwx--   10   [ heap ]
      05000000    4096K   4M rwx--    8   [ heap ]
      05400000    4096K   4M rwx--    5   [ heap ]
      05800000    4096K   4M rwx--    8   [ heap ]
      05C00000    4096K   4M rwx--   14   [ heap ]
      06000000    4096K   4M rwx--   16   [ heap ]
      06400000    4096K   4M rwx--    6   [ heap ]
      06800000    4096K   4M rwx--    4   [ heap ]
      06C00000    4096K   4M rwx--   17   [ heap ]
      07000000    4096K   4M rwx--    4   [ heap ]
      07400000    4096K   4M rwx--    5   [ heap ]
        (continues for 200 more lines) 

   Emphasis: there is plenty of free memory in the first locality group.
   There a processor that is almost entirely free.  Neverthelesss, even
   after more than 24 hours, local bigpage allocations are not completed.


=================================
3. Users do not have a way to teach it to make better fallback decisions.

   If the fallback decisions are not as intelligent as we might wish, and
   if the OS does not adjust to the load over time to make better fallback
   decisions, then can the user at least provide information to improve
   the decisions?  Not at this time.

   One could imagine tunables that would indicate a preference among
   fallback options (a), (b), and (c).  These preferences might be
   expressed as direction to the OS, or as mere guidance.  While many
   solutions might be considered, for the sake of specificity, here's a
   proposal for a pair of parameters.  The intent is that these be
   considered as a pair: both would be implemented.

   ========================
   numa_fallback_preference
   ========================

     Description

        Controls a heuristic for allocation of memory pages when the
        requested page size is not immediately available in the local
        memory group, but could be satisfied from a remote memory group.

        Lower values suggest that the operating system should give more
        weight to the the requested size (potentially allocating the page
        in a remote memory locality group).  Higher values give more
        weight to location (potentially allocating a different size than
        requested).

        Note: when a program requests smaller pages than are immediately
        available in the current locality group, the operating system may
        be able to create them by dividing larger pages into smaller
        pieces.  Although page division is straightforward, page division
        may lead to page fragmentation; a better choice might be to use a
        small page in a remote locality group.  When
        numa_fallback_preference is decreased to lower values, page
        division becomes less likely; when numa_fallback_preference is
        increased, the probability of page division is increased.

     Default

        - 4 (Moderate preference to allocate pages locally, with
          moderately reduced priority for the requested size)

     Range

        - 1 (Strong preference to allocate pages with the requested size) to
        - 5 (Strong preference to allocate pages locally) 

     When to change

        This parameter may be decreased if the workload on the system
        tends to use large address spaces with relatively sparse access,
        because in this situation TLB misses will be more costly than
        remote latency.  

        This parameter may be increased if the workload tends to use
        smaller address spaces, or for programs that touch most of the
        memory that they allocate.  In such situations, the cost of a
        remote memory access may dominate the cost of TLB misses.

        TLB misses may be observed using the -T option to trapstat (1M).
        Page locations and sizes may be observed via the NUMA
        Observability Tools, available at opensolaris.org.


   ========================
   page_coalesce_priority
   ========================

     Description

        Controls a heuristic for the allocation of pages when the
        requested page size is not immediately available, but smaller
        pages are available.  

        When a program requests larger pages than are immediately
        available, it may not be possible to fulfill the request without
        substantial cost, such as moving smaller pages around to coalesce
        larger pages.  Page coalesce operations may be done either by a
        background thread at low priority, or at the time that a large
        page is requested.

     Default

        - 2 (a background thread runs at low priority to attempt to
          coalesce pages, but coalesce operations are not attempted at the
          time of a large page request)

     Range

        - 1 (Allocation of the requested size is considered relatively low
          priority.  Page coalesce operations are rarely attempted.)

        - 5 (Allocation of the requested size is considered a high
          priority.  Pages coalesce operations are frequently attempted,
          including both by a background thread and at the time that page
          requests are made.) 

     When to change

        This parameter may be decreased when the workload consists
        primarily of programs with relatively small address spaces.  It
        may be increased if running programs with large address spaces,
        especially if these programs explicitly request large pages, and
        the requests often fail, resulting in increased TLB miss activity.

        TLB misses may be observed using the -T option to trapstat (1M).
        Page sizes may be observed via the -s option for pmap (1).

Business impact: on unannounced processor is discussed under comments.
Observations at a customer site will also be added under comments.
Work Around
N/A
Comments
N/A